AD-A148  465  COMPLEXITY  TESTABILITY  AND  FAULT  ANALYSIS  OF  DIGITAL 

ANALOG  AND  HYBRID  SY.  .  (U)  TENNESSEE  UNIV  KNOXVILLE  DEPT 
OF  ELECTRICAL  ENGINEERING  R  C  GONZALEZ  ET  AL. 
UNCLASSIFIED  20  SEP  84  TR-EE-84-50  N00014-78-C-0211  F/G  12/1 


NL 


COMPLEXITY,  TESTABILITY.  AND  FAULT 
ANALYSIS  OF  DIGITAL.  ANALOG,  AND 
HYBRID  SYSTEMS 

Final  Report 

Prepared  tor  the 

Office  of  Naval  Reseach 
Arlington,  Va.  22217 


Contract  No.  N00014  78C  0311 
September  30.  1984 
TR-EF -84-50 


w 


COMPLEXITY,  TESTABILITY,  AND  FAULT 
ANALYSIS  OF  DIGITAL,  ANALOG,  AND 
HYBRID  SYSTEMS 


Final  Report 


Prepared  lor  the 

Office  of  Naval  Reseach 
Arlington,  Va.  22217 


Contract  No.  N00014- 78-C*  0311 


Saptambar  30, 1984 
TR-EE-84-50 


Investigators: 

R.C.  Gonzalez,  Electrical  Engineering  Dept.,  University  of  Tennessee, 
Knoxville,  TN  37996 

M.G.  Thomason,  Computer  Science  Dept.,  University  of  Tennessee, 
Knoxville,  TN  37996 

B.M.E.  Moret,  Computer  Science  Dept.,  University  of  New  Mexico, 
Albuquerque,  NM  87131 


VA 


UNr.l  ASSTFTFD 


SECURITY  CLASSIFICATION  of  THIS  FACE  !Wh*n  Of  Bntorod) 


REPORT  DOCUMENTATION  PAGE 

READ  INSTRUCTIONS  i 

BEFORE  COMPLETING  FORM  j 

1.  REPORT  NUMBER  2.  OOVT  ACCESSION  NO. 

TR-EE-84-50 

4.  TITLE  (mid  Subllllo ) 

Complexity,  Testability,  and  Fault  Analysis  of 
Digital,  Analog,  and  Hybrid  Systems 

S.  TYPE  OF  REPORT  A  PERIOD  COVERED 

Final  Report 

Feb  1  ,  1982  -  June  30,  1984 

S.  PERFORMING  ORG.  REPORT  NUMBER 

7.  AUTHOR!*; 

R.  C.  Gonzalez,  M.  G.  Thomason,  B.  M.  E.  Moret 

S.  CONTRACT  OR  GRANT  NUMBER!*; 

N0001 4-78-C-031 1 

S.  PERFORMING  ORGANIZATION  NAME  AND  ADDRESS 

Electrical  Engineering  Department 

University  of  Tennessee 

Knoxville.  TN  37996 

10.  PROGRAM  ELEMENT.  PROJECT,  TASK 
AREA  •  WORK  UNIT  NUMBERS 

II.  CONTROLLING  OFFICE  NAME  AND  ADDRESS 

Office  of  Naval  Research 

Arlington,  VA  22217 

12.  REPORT  DATE 

September  30,  1984 

U.  NUMBER  OF  PAGES 

156 _ 

14.  MONITORING  AGENCY  NAME  4  ADDRESS!!!  d!!!<r*nt  Irom  Controlling  Olltco) 

Same  as  11 

IS.  SECURITY  CLASS,  (ol  thla  rapoit; 

Unclassified 

16.  DISTRIBUTION  STATEMENT  (ol  thlr  Report) 

Distribution  Unlimited 

17.  DISTRIBUTION  STATEMENT  (ol  fha  obolro cl  anfarad  In  Block  20.  II  dll  lot  on!  tram  Report) 

Same  as  16 

It.  SUPPLEMENTARY  NOTES 

None 

It.  KEY  WORDS  (CmMimm  on  fororo#  mldo  It  nmcomomry  ami  Identity  by  block  number) 

Decision  trees,  decision  diagrams,  activity,  boolean  functions,  fault  trees, 
complexity,  Mahalanobis  distance,  moments,  semi -invariants,  linear  inequal¬ 
ities. 

20.  ABSTRACT  (Contlnuo  on  rtnm  old*  II  mcMHir  and  Identity  by  block  number) 


yThe  research  described  in  this  report  is  divided  into  two  major  areas:  (1) 
discrete  mathematical  descriptions  of  digital,  analog,  and  hybrid  systems, 
and  (2)  techniques  for  measuring  parameters  for  characterizing  certain 
aspects  of  such  systems.  New  results  are  presented  in  decision  trees,  fault 
trees,  testing  complexity,  optimal  solution  of  linear  inequalities,  and  the 
computation  of  moments  and  semi-invariants  of  the  interclass  Mahalanobis 
distance. 


DD 


FORM 
I  JAN  71 


1473  EDITION  OF  1  NOV  •*  It  OBSOLETE 
S/N  0102-  LF-  01 4-  6601 


UNCLASSIFIED 


SECURITY  CLASSIFICATION  OF  THIS  FADE  (Who*  Data  Bnlotod) 


TABLE  OF  CONTENTS 


t 


Section  I.  SUMtARY 

Introduction  . 

Decision  Trees  . 

Fault  Trees  . 

Testing  Complexity  . 

Optimal  Solution  of  Linear  Inequalities  . 

Moments  of  the  Interclass  Mahalanobis  Distance  . 

Semi -Invariants  of  the  Interclass  Mahalanobis  Distance  .  .  .  . 


Section  II.  DECISION  TREES 

The  Activity  of  a  Variable  and  Its  Relation  to  Decision  Trees 

Decision  Trees  and  Diagrams  . 

The  Use  of  Activity  In  Testing  Digital  and  Analog  Systems  .  . 

Optimization  Criteria  for  Decision  Trees  . 

Symmetric  and  Threshold  Boolean  Functions  are  Exhaustive  .  .  . 


Section  III.  FAULT  TREES 

Boolean  Difference  Techniques  for  Time-Sequence  and  Common 
Cause  Analysis  of  Fault-Trees  . 


Section  IV.  TESTING  COMPLEXITY 

On  Minimizing  a  Set  of  Tests  . 

Section  V.  OPTIMAL  SOLUTION  OF  LINEAR  INEQUALITIES 

Optimal  Solution  of  Linear  Inequalities  with  Applications 
to  Pattern  Recognition  . 

Section  VI.  MOMENTS  OF  THE  INTERCLASS  MAHALANOBIS  DISTANCE 

Moments  of  the  Interclass  Mahalanobis  Distance  .... 


Section  VII.  SEMI- INVARIANTS  OF  THE  INTERCLASS  MAHALANOBIS  DISTANCE 


Semi-Invariants  of  the  Interclass  Mahalanobis  Distance  .  . 


At 

f 


f 


SECTION  I 
SUMMARY 


SECTION  I 
SUMMARY 


1-  INTRODUCTION 


This  is  the  final  report  on  the  work  in  "complexity, 
testability,  and  fault  analysis  of  digital,  analog,  and 
hybrid  systems"  carried  out  on  ONR  Contract 
N00014-78-C-0311.  This  work  was  performed  by  Drs.  R.  C. 
Gonzalez  and  M.  G.  Thomason  at  the  University  of 
Tennessee.  Knoxville-  and  by  Dr.  B.M.E.  Moret  initially  at 
the  University  of  Tennessee,  Knoxville,  and  subsequently  at 
the  University  of  New  Mexico,  Albuquerque.  Other 
individuals  were  also  involved  for  short  periods  of  time 
The  research  has  produced  significant  theoretical  and 
practical  results  which  have  appeared  in  technical  journals 
and  technical  reports. 

The  research  was  divided  into  two  major  areas:  discrete 

mathematical  descriptions  of  aspects  of  digital,  analog,  and 
hybrid  systems  useful  in  the  study  of  complexity  and  fault 
analysis-;  and  techniques  for  measuring  parameters  to 
characterize  certain  aspects  of  such  systems.  In  this 
summary  section-  we  give  an  overview  of  the  various  results. 
Sections  II  through  VII  contain  compilations  of  the  articles 
and  reports  resulting  from  this  work.  The  material  in  these 
sections  is  orqanized  in  the  same  order  as  the  discussion  in 
this  summary  section. 
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2.  DECISION  TREES 


Early  In  this  work,  decision  trees  and  equivalent 
expressions  were  adopted  as  the  discrete  mathematical 
representation  of  functions  for  detailed  study.  There  is  a 
one-to-one  correspondence  between  a  tree  for  a  discrete 
function  and  an  expression  for  the  same  function;  hence# 
one  can  select  the  representational  form  which  is  better 
suited  to  the  manipulation  required  in  any  specific  case- 
Por  instance-  a  fault  tree  for  a  digital,  analog#  or  hybrid 
system  is  a  concept  widely  used  to  represent  the 
interconnections  of  subsystems  as  a  directed  graph  which 
clearly  illustrates  the  hierarchical  decomposition  into 
major  subsystems,  then  minor  subsystems#  then  individual 
components;  but  the  equivalent  fault  expression  is  often 
easier  to  manipulate  when  one  wants  to  determine  the 
criticality  of  a  subsystem  or  estimate  the  total  system's 
reliability  as  based  on  subsystem  or  component-level 
calculations. 

The  initial  work  on  decision  trees  was  carried  out  as  Dr. 
Horet's  PhD  research  at  the  University  of  Tennessee  and  has 
continued  with  a  focus  on  the  area  of  fault  trees.  The 
major  results  are  these  four  contributions: 


i)  a  generalization  of  decision  trees  to  simple 
recursive  functions  through  a  process  of  composition  which 
allows  functional  as  well  as  hierarchical  decomposition  of 
systems,  including  systems  with  feedback; 


ii)  a  characterization  of  the  complexity  of  testing 
certain  classes  of  Boolean  functions#  which  has  implications 
in  logic  design  and  programming; 


iii)  a  study  of  the  "activity"  of  a  variable  as  a 
generalization  of  Chow  parameters  with  close  connections  to 
Boolean  differences#  which  is  a  useful  tool  in  assessinq 
subsystem  importance  and  designing  test  sets; 


iv)  an  extension  of  Boolean  difference  techniques  to 
the  analysis  of  time-dependent  systems  with  applications  to 
common-cause  analysis. 


Decision  trees  are  a  natural  model  of  the  sequential 
evaluation  of  discrete  functions  where,  at  each  node#  a 
variable  is  evaluated  and  a  decision  (to  output  the 
functional  value  or  to  look  at  another  variable)  is  made. 
Such  a  model  is  effective  for  Boolean  functions  as  well  as 
more  general#  multivalued  functions  because  it  is  a  compact 


representation  with  an  inherent  ordering  of  variables  for 
evaluation. 

Since  the  number  of  tree  forms  for  a  given  discrete  function 
has  an  exponential  dependence  on  the  number  of  intrinsic 
variables,  the  complexity  of  optimizing  decision  trees  with 
respect  to  several  criteria  was  examined  in  detail  and 
reported  in  Moret  [1980]  and  Moret  et-  al  [1981a,  1981b]. 
A  specific  measure  on  discrete  functions,  called  the 
activity  of  a  variable,  was  defined  and  shown  to  be  closely 
related  to  the  evaluation  cost  of  decision  trees  for  the 
function  [Moret  et  al.  1980].  The  activity  of  a  variable 
is  a  generalization  of  concepts  developed  in  the  framework 
of  Boolean  functions,  such  as  Chow  parameters  and  Boolean 
differences  (cf.,  Moret  et  al.  [1980]),  all  of  which  are 
valuable  analytical  tools  in  studies  of  the  importance  of 
subsystems  in  the  overall  system  operation  and  the 
suseptability  of  total  system  failure  to  the  failures  of 
individual  subsystems. 

The  activity  of  a  variable  also  finds  application  in  system 
testing.  In  particular,  exercising  those  variables  having 
the  hiqhest  activities  maximizes  the  probability  of  error 
detection  in  systems  with  equally  likely  faults.  Moreover, 
the  concept  can  be  extended  to  sequential  functions  by 
considering  the  long-term,  steady-state  distribution  of 
system  states,  so  that  tests  can  be  designed  to  reflect 
average  or  normal  operational  modes.  Finally,  the  activity 
is  similar  to  previously  used  measures  of  subsystem 
criticality  or  importance;  however,  the  activity  measure 
readily  generalizes  to  multi-valued  models — a  significant 
advantage  where  analog  and  hybrid  systems  are  concerned. 

Some  results  were  obtained  for  proper  subsets  of  the  Boolean 
.functions.  It  was  shown  that  all  symmetric  and  threshold 
Boolean  functions  have  worst-case  (i.e.,  total  variable) 
testing  complexity.  Since  these  functions  are  commonly 
encountered  in  fault  modeling,  logic  design,  and  pattern 
recognition-  this  result  provides  information  useful  in 
these  fields.  The  result  appears  in  Moret  et  al.  [19831. 

It  should  be  noted  that  adopting  decision  trees  rather  than 
more  conventional  forms  of  discrete  functions  led  to  a 
unified  framework  in  which  several  previously  disconnected 
results  were  seen  to  fit  together.  Dr.  Moret's  survey 
article  [Moret,  19821  has  been  used  by  practitioners  in  a 
variety  of  fields,  including  engineering  and  scientific 
disciplines,  and  has  been  cited  frequently. 


3.  FAULT  TREES 


The  most  recent  work  has  concentrated  on  fault  trees  as  a 
special  usaqe  of  decision  trees  as  system  models  with 
emphasis  on  reliability  and  testing.  A  digital,  analog,  or 
hybrid  system  is  modelled  as  a  configuration  of  basic 
components,  each  of  which  is  either  working  or  faulty  (as 
defined  by  the  value  of  a  Boolean  state  variable);  the 
configuration,  in  turn,  is  described  by  higher-level 
subsystem  functions,  and  ultimately  by  the  overall  system 
function  (the  value  of  which  is  the  "top  event”  state  in 
that  it  indicates  ”system  working”  or  "system  failed”).  A 
fault  tree  itself  is  a  logic-operation  realization  of  this 
system  function. 

Such  trees  are  widely  used  in  areas  in  which  very  complex 
systems  must  be  analyzed,  for  example,  in  the  aircraft  and 
nuclear  power  industries.  However-  as  usually  developed, 
fault  trees  do  not  account  for  time-dependent  system 
reconfigurations  or  for  non-binary  component  behavior  (eg., 
partially  working  and  satisfactory  for  some  but  not  all. 
configurations).  In  order  to  extend  the  applicability  of 
the  fault  tree  concept,  Moret  and  Thomason  have  extended  an 
idea  of  Thomason  and  Page  [1976]  in  using  time-dependent 
Boolean  differences  for  analysis  of  sequential  fault 
functions  for  systems  which  undergo  a  reconfiguration  at 
discrete  points  in  time  Initial  work  on  the  inclusion  of 
probabilities  was  also  performed  so  that  estimates  of 
long-run  failure  probabilities  could  be  calculated  for 
appropriate  assumptions  of  steady-state  conditions  and 
independence  of  failure  events. 

This  method  also  allows  a  study  of  arbitrary 
subconfigurations  in  the  total  system.  A  characterization 
of  minimum  and  maximum  test  conditions  has  been  developed 
for  the  sensitization  of  the  system  to  an  arbitrary 
combination  of  events  in  the  subsystems.  It  is  shown  that 
some  fundamental  results  in  stochastic  process  theory  can  be 
applied  to  time-dependent  systems  with  suitable  transition 
probabilities.  These  results  provide  a  basis  for  the 
qualitative  and  quantitative  analysis  of  "common-causes"  of 
simultaneous  failures  in  several  subsystems. 


4.  TESTING  COMPLEXITY 


The  problem  of  developing  test  sets  can  be  considered  in  two 
stages:  a  stage  in  which  potential  tests  are  designed  and 
their  results  measured.  and  a  stage  in  which  the  final  set 
of  tests  is  selected.  This  second  stage  involves  the 
optimization  of  some  criterion  function  and  often  reduces  to 
selecting  the  smallest  possible  number  of  tests  in  the  final 
collection  i,e..  the  "minimum  test  set  problem.”  This 
computationally  intractable  problem  is  NP-hard.  as  a  result 
of  which  manv  researchers  have  have  worked  on  suboptimal 
strategies  in  the  form  of  heuristic  search  methods. 

The  objective  of  the  study  on  this  contract  was  a 
theoretical  and  practical  evaluation  of  various  suboptimal 
strategies.  Several  minimization  routines  were  run  under 
different  conditions  for  comparisons  of  their  growth  in 
complexity  predicted  by  theory  with  the  difficulties 
actually  encountered  in  solving  real  problems  (Moret  and 
Shapiro  [1982]). 

Characterizing  several  aspects  of  the  suboptimal  algorithms 
required  extensive  experimentation.  The  outcome  of  over 
three  thousand  test  runs  brought  to  liqht  two  encouraging 
results.  First-  despite  the  theoretical  prediction  of 
sharply  increasing  complexity.  the  coherence  or 
■individuality"  of  the  real  world  problems  caused  their 
complexity  to  increase  only  slowly  with  size.  Second,  the 
experiments  clearly  showed  that  one  of  the  algorithms  was 
far  superior  to  the  others  for  the  range  of  problems 
considered;  this  was  in  agreement  with  theoretical 
predictions  based  on  bounding  methods  that  were  developed  in 
the  course  of  this  contract. 

■  * 

Overall,  this  effort  has  contributed  to  a  much  better 
understanding  of  the  "minimum  test  set  problem"  and  its 
various  suboptimal  solutions.  In  particular,  the  best 
existing  algorithm  has  been  identified  and  characterized. 
Well  supported  by  intuition  and  empirical  evidence,  an  exact 
characterization  of  the  best  behavior  of  such  algorithms  has 
been  conjectured  but  as  yet  not  proved;  should  it  prove 
true-  the  existing  alogorithm  would  in  fact  be  as  good  as 
can  be  achieved. 


5.  OPTIMAL  SOLUTION  OP  LINEAR  INEQUALITIES 


Linear  inequalities  are  applicable  in  digital  systems  in  the 
areas  of  threshold  functions  and  pattern  recognition. 
Although  the  solution  of  consistent  inequalities  is 
straightforward  (e. g.f  by  linear  programming),  relatively 
little  is  known  about  the  solution  of  inconsistent 
inequalities.  The  first  practical  algorithm  is  this  area 
was  reported  by  Warmack  and  Gonzalez  in  1973.  These  results 
were  generalized  by  Clark  and  Gonzalez  [1981)  as  part  of  the 
work  on  this  contract.  The  Clark-Gonzalez  algorithm  is  a 
nonenumerative  procedure  guaranteed  to  find  all  optimal 
solutions  to  a  set  of  inconsistent  inequalities.  (Finding 
the  solutions  of  consistent  inequalities  is  a  special  case 
of  this  method.)  Bounds  on  the  search  carried  out  by  the 
algorithm  were  developed,  and  the  method  was  shown  to  be 
computationally  superior  to  other  methods  (including  the 
Warmack-Gonzalez  algorithm)  for  finding  minimum-error 
solutions. 
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6-  MOMENTS  OF  THE  INTERCLASS  MAHALANOBIS  DISTANCE 


The  Mahalanobis  distance  is  a  measure  of  similarity  between 
multivariate  Gaussian  populations.  In  terms  of  the  work  in 
this  contract,  the  Mahalanobis  distance  offers  a  robust 
descriptor  for  characterizing  multivariate  measurements 
performed  in  an  analog  system.  In  this  context ,  the  problem 
can  be  formulated  as  a  pattern  recognition  task  whose 
objective  is  to  detect  deviations  from  a  normal  mode  of 
operation. 

When  treated  as  a  random  variable,  the  Mahalanobis  distance 
has  a  probability  density  function  (PDF)  that  can  be  related 
to  the  probability  of  error  in  classification  (e.g., 
classification  of  normal  vs.  abnormal  operation),  when  the 
covariance  matrices  are  equal,  obtaining  this  PDF  is 
straightforward;  however-  the  more  general  (and  practical) 
case  involving  unequal  covariance  matrices  requires 
complicated  numerical  integration  techniques  for  determining 
the  PDF. 

In  many  applications  of  multivariate  data  description,  it  is 
of  interest  to  compute  the  moments  of  the  Mahalanobis 
distance  without  having  to  estimate  its  underlying  PDF  as  an 
intermediate  step.  In  a  recent  paper  (Gonzalez  and  Wagner 
(19831)  it  was  shown  that  the  moments  of  the  interclass 
Mahalanobis  distance  between  two  multivariate  groups  of  data 
(also  called  classes)  can  be  expressed  in  a  simple 
polynomial  form.  The  nth  moment  is  expressible  as  a 
polynomial  of  order  n  whose  variable  depends  upon  the  mean 
vectors  and  eigenvalues  of  the  covariance  matrices  of  the 
two  populations.  A  closed  form  solution  is  also  given  for 
computing  the  coefficients  of  the  expressions.  The  relative 
*  simplicity  of  these  results  has  important  implications  in 
terms  of  implementation  in  a  digital  computer  or  dedicated 
hardware. 


7.  SEMI- INVARIANTS  OP  THE  INTERCLASS  MAHALANOBIS  DISTANCE 


An  alternative  to  the  technique  discussed  in  the  previous 
section  is  to  compute  the  semi-invariants  (which  do  not 
require  that  the  eigenvectors  be  known)  and  then  obtain  the 
moments  from  the  semi-invariants.  A  new  approach  for 
obtaining  the  semi-invariants  was  recently  reported  by 
Gonzalez  and  Wagner  [19841.  The  semi-invariants  are  given 
directly  in  terms  of  the  mean  vectors  and  inverse  covariance 
matrices.  It  is  well  known  that  the  moments  and 
semi-invariants  are  related  by  expressions  which,  though 
theoretically  simple,  are  quite  inefficient  in  terms  of 
computation.  A  new,  iterative  algorithm  that  is  easily 
implemented  on  a  computer  was  also  reported  in  the  same 
paper. 
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The  construction  of  sequential  testing  procedures  from  functions  of  discrete  arguments  is  a  common 
problem  in  switching  theory,  software  engineering,  pattern  recognition,  and  management.  The  concept 
of  the  activity  of  an  argument  is  introduced,  and  a  theorem  is  proved  which  relates  it  to  the  expected 
testing  cost  of  the  moat  general  type  of  decision  trees.  This  result  is  then  extended  to  trees  constructed 
from  relations  on  finite  sets  and  to  decision  procedures  with  cycles.  These  results  are  used,  in  turn,  as 
the  basis  for  a  fast  heuristic  selection  rule  for  constructing  testing  procedures.  Finally,  some  bounds 
on  the  performance  of  the  selection  rule  are  developed. 
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•1.  INTRODUCTION 

A  common  problem  in  switching  theory,  software  engineering,  pattern  recogni¬ 
tion,  and  management  is  the  construction  of  sequential  testing  procedures  (also 
called  decision  trees  or  decision  programs)  from  a  given  function  of  discrete 
arguments  [1,  5,  9,  11,  13,  16,  20].  The  problem  is  to  select  from  the  numerous 
available  trees  one  which  is  an  optimal  tree  representation  with  respect  to  some 
criterion.  In  particular,  it  is  often  desired  to  select  a  tree  which  has  the  smallest 
expected  testing  cost,  that  is,  a  tree  such  that  the  average  cost  of  determining  a 
value  of  the  function  (by  testing  some  of  the  variables)  is  minimal.  Variants  of 
this  problem  have  been  studied  by  many  researchers,  who  have  provided  search 
algorithms  to  find  the  optimal  tree(s)  [4,  7, 10,  12, 15]  or  proposed  heuristic  rules 
for  constructing  suboptimal  trees  [3,  5,  6,  14,  17,  19]. 

In  this  paper  we  introduce  the  concept  of  activity  of  a  variable  and  prove  a 
theorem  relating  it  to  the  expected  testing  cost  of  decision  trees  with  costs  and 
probabilities.  This  result  is  then  extended  to  trees  constructed  from  relations  on 
finite  sets  and  to  decision  procedures  with  cycles  (corresponding  to  recursive 
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functions).  This  provides  the  basis  for  a  fast  heuristic  selection  rule,  which  is  a 
generalization  of  criteria  proposed  in  [4]  and  [15].  Finally,  we  examine  certain 
conditions  under  which  the  rule  performs  optimally  and  give  some  bounds  on  its 
behavior. 

2.  PRELIMINARIES 

We  are  given  a  (partial)  function  of  n  discrete-valued  variables,  f(xx, . . . ,  x«); 
each  variable  x,  can  take  on  exactly  m,  values,  m,  >  1,  and  the  determination  of 
its  value  incurs  cost  c,.  A  discrete  probability  distribution  is  also  specified  on  the 
[[  ?-i  m,  points  of  the  variables’  space  (combinations  of  variable  values  which  do 
not  belong  to  the  inverse  image  of  /  do  not  necessarily  have  zero  probability);  the 

probability  of  a  point  is  denoted  p(x( . x„).  It  is  noted  that  the  probability 

p(x,  —  k )  that  variable  x,  will  take  on  value  k  can  be  computed  by 

p(x,-A)-  £  •••  £  £  •••  £  p(x . . x,-i,  k.  x,*i, . . . , x»).  (1) 

*l“l  *1-1“'  *..!“*  «.“» 

Definition  1.  If  fix  x . x„)  —  constant,  then  the  decision  tree  for  /  is  a  leaf 

labeled  “constant";  otherwise,  for  each  x„  /  has  decision  tree(s)  composed  of  a 
root  labeled  "x,”  and  m,  decision  subtrees,  corresponding  to  the  m,  subfunctions 
/«,-*»,  1  £k£mt. 

If  .variables  x*, . x*.,  in  that  order,  are  tested  along  path  P*.  yielding  values 

t?i , ....  Vnk  and  leading  to  leaf  a*,  then  the  probability  of  reaching  leaf  a*  is  the 
sum  of  the  probabilities  of  all  combinations  of  variable  values  leading  to  that  leaf. 
Using  (1)  above,  this  can  be  written  as 

«» 

p(«*>  -  n  p(**,  ■ 

*•! 

In  following  path  P*.  we  test  n*  variables  for  a  total  cost  of 

C(P*)  -  X  ck, 

i~  i 

Thus  the  expected  testing  cost  of  the  tree  T  is  the  quantity 

C(D-Xp(o*).C(P*), 

* 

where  the  sum  is  taken  over  all  leaves  a*  of  T. 

Any  internal  node  of  a  decision  tree  T  is  associated  with  a  subfunction  of  T. 
That  subfunction  itself  has  a  probability  which  is  the  sum  of  the  probabilities  of 
the  combinations  of  variable  values  included  in  the  subfunction.  This  is  equal  to 
the  probability  of  reaching  the  said  internal  node  or,  equivalently,  to  the  sum  of 
the  probabilities  of  the  leaves  of  the  subtree  rooted  at  that  internal  node.  In  the 
following  section  we  shall  be  interested  in  the  subfunction  resulting  from  a 
combination  of  n  -  1  values,  that  is,  the  case  in  which  the  values  of  all  variables 
but,  say,  x,  are  fixed,  resulting  in  the  selection  of  an  m, -tuple  of  possible  combi¬ 
nations — the  m,  values  of  the  unspecified  variable  x,. 
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p-0.1  p=0 . 2 

Fif  1.  A  sample  decision  tree  for  Example  1. 

Example  1.  Let  /  be  a  partial  function  of  three  binary  variables,  f :  (0,  l}3  — * 
(1, 2, 3),  given  by 

(0,0,0) -3,  (1,0,0) -2, 

(0, 1,  0)  —  1,  (1,  0, 1)  -*  2, 

(0, 1. 1)  -  3,  (1, 1,  0)  -  1. 

The  costs  are  Ci  -  0.5,  c2  "  0.68,  and  c3  -  0.25,  and  the  probability  distribution  is 
specified  by 

p (0,  0, 0)  -  0.10,  p(l,  0, 0)  -  0.25, 

p(0,  0, 1)  -  0.05,  p(l,  0, 1)  -  0.05, 

p(0, 1. 0)  -  0.20,  p(l,  1, 0)  -  0.15, 

p(0,  1,  1)  -  0.20,  p(l,  1,  1)  -  0.00. 

A  possible  decision  tree  for  this  function  is  illustrated  in  Figure  1,  together  with 
the  probabilities  of  the  leaves.  The  expected  testing  cost  of  that  tree  is 

C(T)  -  0.1- (0.5  +  0.25  +  0.68)  +  0.2  -(0.5  +  0.25  +  0.68) 

+  0.25  (0.5  +  0.25)  +  0.3  -(0.5  +  0.68)  +  0.15- (0.5  +  0.68)  -  1.1475.  □ 

3.  THE  ACTIVITY  OF  A  VARIABLE 

Considering  the  m, -tuple  of  combinations  mentioned  at  the  end  of  Section  2,  we 
distinguish  two  cases: 

(i)  two  of  the  m,  combinations  are  mapped  to  distinct  values  by  f  \ 

(ii)  no  such  two  combinations  can  be  found. 

In  the  first  case,  variable  x,  must  be  tested  in  order  to  distinguish  all  values  of  the 
function;  in  the  second  case,  this  is  not  necessary,  although  it  may  be  done  in  a 
particular  tree,  either  as  a  redundant  test  or  because  at  least  one  variable  did  not 
belong  to  the  inverse  image  of  f  and  has  been  arbitrarily  mapped  to  a  value 
distinct  from  the  image  of  the  other  combinations.  Thus,  the  a  priori  probability 
p*  (x, )  that  variable  x,  will  be  needed  in  testing  all  the  values  of  f  (i.e.,  the 
probability  that  /  will  be  sensitized  to  x.)  is  equal  to  the  sum  of  the  probabilities 
of  all  the  m,-tuples  satisfying  case  (i)  above;  conversely,  the  a  priori  probability 
pj  (x, )  that  x,  will  be  useless  is  equal  to  the  sum  of  the  probabilities  of  the 
remaining  m, -tuples,  those  satisfying  case  (ii). 
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The  same  reasoning  is  easily  adapted  to  a  subfunction  f  by  normalizing  the 
probabilities  with  the  probability  of  f.  For  any  x,  and  f,  pj  (x.)  +  pj  (x.)  *  1. 

Definition  2.  The  activity  of  variable  x,  with  respect  to  subfunction  /  is  de¬ 
fined  as  the  quantity 

a/(x.)  -  c,-p}(xt). 

Definition  3.  The  loss  of  variable  x,  with  respect  to  subfunction  /  is  defined 
as  the  quantity 

l;(x,)  -  c,  -  afix,). 

The  activity  of  a  variable  is  a  measure  of  how  much  influence  a  variable  has  on 
the  determination  of  a  function’s  values.  A  related  concept,  known  as  Chow 
parameter  [18],  is  discussed  in  [2]  and  [4];  when  all  costs  are  unity  and  all  variable 
combinations  equally  likely,  the  activity  of  a  variable  with  respect  to  a  completely 
specified  Boolean  function  of  n  variables  reduces  to  the  Chow  parameter  of  the 
variable  divided  by  2"-1. 

The  loss  of  a  variable  x  is  a  measure  of  the  wasted  decision  power  associated 
with  the  choice  of  x  as  the  root  of  the  decision  tree.  This  is  intuitively  obvious, 
since  such  a  choice  results  in  testing  the  variable  with  probability  1,  while  the  a 
priori  probability  of  needing  x  was  p*{x). 

Example  2.  The  various  quantities  defined  above  are  computed  for  the 
function  of  Example  1  and  are  listed  below. 


p*(x\)  =  0.35, 

p;<x>)  -  0.7, 

pf*(x3)  -  0.4, 

af(xi)  »  0.175, 

<Mx>)  -  0.476, 

aflx3)  =  0.1, 

l((xx)  -  0.325, 

/,<xj>  =0.204, 

lf(x3)  =  0.15. 

□ 

The  following  theorem  establishes  the  relationship  between  activity,  loss,  and 
expected  testing  cost  of  decision  trees.  The  proof  technique  is  derived  from  [4], 
where  a  simplified  version  of  this  theorem  using  Chow  parameters  was  proved  for 
completely  specified  monotone  Boolean  functions  of  uniformly  distributed  vari¬ 
ables  with  unity  costs. 

Theorem  1.  The  expected  testing  cost  C(T)  of  a  decision  tree  T  for  the 
function  fix i, . . .  x„)  can  be  expressed  as 

C(T)  «  1  a, lx,)  4-  Ip(f)-lfifik),  (2) 

i-i  * 

where  the  second  sum  is  taken  over  all  internal  nodes  fit  and  f  refers  to  the 
subfunction  associated  with  fit- 

Remark.  This  theorem  says  that  the  expected  testing  cost  of  a  decision  tree 
is  composed  of  a  fixed  “overhead”  (the  first  sum)  and  a  variable  amount  of  “loss” 
(the  second  sum)  which  depends  on  the  structure  of  the  tree. 

Proof.  The  proof  is  by  induction  on  n,  the  number  of  variables.  For  n  -  1, 
the  basis  is  easily  verified:  the  variable  space  is  just  an  m-tuple,  and  there  are 
only  two  possible  tree  structures.  Assume  that  the  theorem  holds  for  all  functions 
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of  up  to  and  including  n  -  1  variables,  and  let  /  be  a  function  of  n  variables. 
Choose  x,  to  be  the  root  of  T.  This  determines  m,  subfunctions,  each  of  n  -  1 
variables,  so  that  the  inductive  hypothesis  applies  and  for  each  subfunction  f„ 
7  —  1, ... ,  m,,  we  have 

C(Tj)  -  £  af)(x>)  +  lp(fJ)-l?J(0.), 

*-i  * 


where  the  second  sum  is  taken  over  all  internal  nodes  0.  of  T,.  But  C(T) 
£ jli p{f,)-C(Tj),  and  after  substituting  and  simplifying,  we  obtain 


C(T) 


d  -  l,(x. )  +  l 

j-  i 


•  2  j 


+  2  P(f)-Mk), 


c,  + 


(3) 


where  the  last  sum  is  taken  over  all  internal  nodes  0k  of  T.  But  we  know  that 

c,  -  lf(x.)  -  at(Xi) 

and 

£  af(Xj)  -  df(x,)  +  Y.  ( 

/-i  y-i 

Substitution  of  these  two  equalities  in  (3)  yields 


(pW-  2  «<(**)  j- 


cm  -  l  a/ix,)  +  Zp(f)-li(0k), 
1-1  * 


where  the  second  sum  is  taken  over  all  internal  nodes  0k  of  T.  □ 

Corollary  1.  The  expected  testing  cost  of  any  decision  tree  T  for  the 
function  f(x  i , . . .  x„)  having  x,  as  root  is  bounded  by 

£  Cj  2  Cm  >  lf(x,)  4  £  af(Xj). 

7—1  7-1 

This  corollary,  in  simplified  form,  was  proved  in  [15]  and  is  implicit  in  [4]>  Both 
references  use  it  as  the  basis  for  a  branch-and-bound  search  algorithm  to  find  a 
tree  that  is  optimal  with  respect  to  the  expected  testing  cost. 

These  results  stress  the  importance  of  the  sum  of  the  activities  of  the  variables 
of  a  function  as  a  representation-independent  measure  of  the  cost  incurred  in 
determining  the  values  of  that  function.  This  motivates  the  following  definition. 

Definition  4.  The  intrinsic  cost  /( f)  of  the  function  f(xu  . . .  x„)  is  defined  as 
the  quantity 

Hf)~  2 

i— I 


Example  3.  Using  the  values  of  activity  computed  in  Example  2  for  the 
function  of  Example  1,  we  obtain  the  intrinsic  cost  of  f. 


1(f)  -  o/(xj)  +  a/(xt)  +  Of(Xi)  -  0.175  +  0.476  +  0.1  -  0.751. 
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Fi(.  2.  The  tree  of  Figure  1  with  node  loeees  and  probabilities 

For  the  tree  of  Figure  1  we  compute  the  loss  and  probability  of  each  internal 
node  to  obtain  the  values  shown  in  Figure  2.  The  sum  of  these  losses,  weighted 
by  the  node  probabilities,  and  of  the  intrinsic  cost  is 

(0.325-1)  +  (0.681.0.55)  4-  <0.075- 0.45)  +  (0.0-0.3)  +  0.751  -  1.1475, 

the  computed  expected  testing  cost  of  the  tree.  Corollary  1  indicates  that  any 
tree  for  /  having  Xi  as  root  will  have  a  minimum  cost  of  0.751  4-  0.325  «  1.076; 
similarly,  any  tree  with  root  x2  will  have  a  minimum  cost  of  0.751  4-  0.204  -  0.955, 
while  trees  rooted  in  x3  have  a  lower  bound  of  0.751  4-  0.15  ■  0.901.  □ 

4.  EXTENSION  TO  RELATIONS 

We  extend  the  definitions  of  activity  and  loss  to  relations  on  finite  sets.  This  is  of 
particular  interest  in  the  case  of  interdependent  functions  which  must  be  repre¬ 
sented  by  a  single  tree  (as  in  [5]). 

A  relation  R  might  specify  no  more  than  one  output  for  each  input  combination, 
in  which  case  it  is  a  (partial)  function.  R  may,  however,  specify  more  than  one 
output,  in  which  case  we  assume  that  we  can  arbitrarily  decide  to  specify  any 
particular  output  or  leave  the  choice  open.  It  is  also  assumed  that  an  unspecified 
entry  (a  "don’t  care")  is  in  fact  related  to  the  whole  output  set,  so  that  any  one 
output  value  can  be  selected  for  such  input  combination.  We  then  extend  the 
definitions  of  activity,  loss,  and  intrinsic  cost  in  the  obvious  way  by  noting  that 
a  variable  is  needed  to  differentiate  the  values  of  an  m,- tuple  if  and  only  if  the 
intersection  of  the  output  sets  specified  by  R  for  the  m,  components  is  empty.  It 
is  readily  verified  that  all  results  previously  stated  for  partial  functions  remain 
valid  for  relations. 

Example  4.  Consider  the  relation  R  from  the  input  set  {0,  1}*  x  {0,  1,  2)  to 
the  output  set  R  ■  (a,  b,c,d ),  where  all  three  variables  have  unity  cost  and  the 
relation  and  the  probability  distribution  are  given  in  Figure  3.  Since  all  variables 
have  unity  costs,  p/(x,)  -  a/(x,),  so  that  a/(x\)  -  0.35,  af(x2)  -  0.2,  and  af{x3)  - 
0.6.  The  intrinsic  cost  of  A  is  HR)  -  0.35  +  0.2  4-  0.6  -  1.15.  Choosing  x3  as  the 
root  for  a  decision  tree  results  in  a  lower  bound  on  the  cost  of  1.15  4-  (1  -  0.6)  - 
1.55.  A  possible  decision  tree  T  rooted  in  x3  is  shown  in  Figure  4,  together  with 
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Fig.  3.  The  relation  and  ita  probability  distribution  for  Example  4. 


Fig.  4.  A  decision  tree  with  node  losses  for  Example  4. 

the  losses  of  its  nodes.  Its  expected  testing  cost  is  C\T)  -  1.15  +  (0.35 •-)  »  1.75, 
and  it  is  in  fact  one  of  the  optimal  trees  for  R.  □ 

5.  EXTENSION  TO  RECURSIVE  FUNCTIONS  AND  RELATIONS 

As  a  further  extension  of  the  foregoing  concepts,  we  consider  the  case  of  a 
recursive  function  or  relation,  that  is,  a  relation  which,  for  certain  input  combi¬ 
nations,  does  not  specify  output  values  but  calls  for  the  evaluation  of  some 
relation,  possibly  itself.  It  is  assumed  that  the  same  tree  structure  is  used  for  all 
evaluations  of  a  given  relation  and  that  an  unspecified  entry  is  not  replaced  by  a 
call  to  a  relation,  but  only  by  values. 

The  following  discussion  is  restricted  to  immediate  recursive  relations,  that  is, 
those  which  do  not  call  for  the  evaluation  of  any  other  relation  than  themselves. 
This  does  not  diminish  the  generality  of  the  development,  as  a  hierarchy  of 
several  different  relations  can  be  analyzed  in  parts  by  considering  each  relation 
separately  and  then  merging  the  results  using  the  probabilities  of  each  relation 
and  of  the  recursive  calls.  Such  an  analysis  is  demonstrated  in  Section  6  by  an 
example. 

Given  an  immediate  recursive  relation,  it  is  possible  under  the  assumptions  to 
compute  the  probability  e  that  an  evaluation  will  be  made  without  recursive  calls. 
If  e  is  1,  the  relation  is  not  recursive;  if  e  is  0,  then  the  relation  will  never  yield  a 
value  but  will  keep  issuing  recursive  calls  ad  infinitum. 

A  first  question  about  such  relations  concerns  an  upper  bound  on  their  testing 
cost.  Such  a  bound  is  set  by  Corollary  1  for  nonrecursive  relations  as  the  sum  of 
the  testing  costs  of  the  variables,  but  can  evidently  be  passed  by  recursive 
relations.  The  following  proposition  provides  the  answer. 

ACM  Tramactiom  on  Profrmunint  Languapa  and  SyMema,  Vol.  Z.  No.  4,  October  1980. 


17 


Activity  of  a  Variable  and  Its  Relation  to  Decision  Trees  *  587 

Proposition  1.  Let  R  be  an  immediate  recursive  relation  on  n  variables 

. . x„  with  costs  ci, ....  c„,  and  let  e  be  as  above ;  then  the  expected  (esting 

cost  ofR  is  no  larger  than  (1/e)-  £f-i  c,. 

Proof.  The  probability  of  a  recursive  call  occurring  in  any  evaluation  is 
1  -  e.  At  worst,  an  evaluation  results  in  the  test  of  all  variables,  for  a  cost  of 
vr„  c,;  thus  the  total  cost  is  no  larger  than 

«• . 

A  decision  procedure  for  a  recursive  relation  is  an  infinite  tree  that  can  also  be 
represented  as  a  diagram  with  cycles,  each  cycle  leading  back  to  the  root  of  a 
subdiagram.  In  the  case  of  immediate  recursive  relations,  all  cycles  lead  back  to 
the  root  of  the  diagram.  We  can  compute  the  probability  that  the  relation  will 
take  on  a  specific  value  by  solving  a  simple  linear  equation,  subject  to  the 
convention  that  entries  for  which  several  values  are  specified  are  set  to  the 
specific  value  under  consideration  wherever  possible. 

The  notion  of  activity  of  a  variable  is  generalized  to  immediate  recursive 
relations  as  follows. 

(i)  If  an  m, -tuple  does  not  include  a  recursive  call,  we  count  its  contribution  in 
the  usual  way. 

(ii)  If  one  or  more  recursive  calls  are  included,  the  contribution  is  the  probability 
of  the  m, -tuple  times  the  testing  cost  of  the  unspecified  variable  times  the 
probability  that  the  m, -tuple  will  be  mapped  to  more  than  one  value. 

We  call  this  quantity  the  tree  activity;  the  corresponding  loss,  the  tree  loss .  is  the 
testing  cost  minus  the  tree  activity.  The  same  quantities  multiplied  by  1/e  will  be 
referred  to  as  diagram  activity  and  diagram  loss. 

Theorem  2.  Let  R  be  an  immediate  recursive  relation,  and  let  a  decision 
procedure  for  R  be  represented  by  a  diagram  D  and  an  infinite  tree  T.  The 
expected  testing  cost  of  the  procedure,  C(D)  *=  C(T),  is  equal  to 

(i)  the  sum  of  the  diagram  activities  and  of  the  diagram  losses  taken  over  all 
internal  nodes  of  the  diagram,  or 

(ii)  the  sum,  taken  over  the  infinite  tree,  of  the  tree  activities  and  of  the  tree 
losses. 

Remark.  The  sum  of  the  diagram  activities  is  called  the  intrinsic  cost  of  the 
relation,  I(R). 

Proof.  The  proof  relies  on  the  original  theorem  for  nonrecursive  functions 
and  on  simple  considerations  on  the  series  1,  1  —  e,  (1  —  e)1,  (1  -  e)3, . . .  and  its 
sum,  1/e.  If  we  replace  all  recursive  calls  in  D  by  leaves,  the  cost  of  the  resulting 
tree  is  the  sum  of  the  tree  activities  and  the  tree  losses  taken  over  all  internal 
nodes  of  the  tree.  Introducing  recursion  results  in  a  series  of  invocations,  the 
probabilities  of  which  are  described  by  the  series  (1  -  e)\  □ 

Corollary  1  is  similarly  extended. 
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Rl  R2 


00  01  11  10  00  01  11  10 


Value* 


X,Xa 

Xt 

00 

01 

11 

10 

?t?2 

Vl 

00 

01 

11 

10 

0 

0.70 

1  0.05 

0.05 

0.05 

0 

0.25 

0.10 

0.10 

0.25 

1 

0.05 

0.01 

0.01 

0.04 

i 

0.10 

0.05 

0.05 

010 

2 

0.01 

0.01 

0.01 

0.01 

Probabilities 


c(*i)«9,  •  9,  eta) -4.5  e<y,)-50.  ciyt)  —  45,  c(  y,)  -  36 

Cott* 

Fi(.  5.  The  relation*  for  the  example  with  their  probabilitie*. 


6.  AN  EXAMPLE 

As  mentioned  above,  the  results  can  be  extended  to  recursive  hierarchies  of 
relations,  subject  to  our  two  restrictions.  The  following  example  shows  how 
systems  of  relations  are  analyzed  part  by  part. 

Consider  a  situation  in  which  a  monitoring  program  must  periodically  evaluate 
several  system  variables.  If  the  sampled  values  point  to  a  satisfactory  status,  the 
program  waits  for  a  specific  period  of  time  and  examines  the  variables  again; 
otherwise,  either  a  malfunction  is  identified  and  the  program  takes  some  action 
and  stops,  or  further  analysis  is  required  and  some  additional  variables  are 
examined  to  determine  whether  the  program  should  resume  its  normal  cycle  or 
take  some  action  and  stop.  The  first  part  of  the  examination  (the  normal  cycle) 
is  described  by  the  relation  Rl,  which  includes  calls  both  to  itself  and  to  the 
second  relation  R2  (the  exception  cycle),  which  includes  calls  to  Rl.  In  this 
example,  Rl  is  a  relation  between  {0,  l}2  x  {0,  1,  2)  and  the  set  of  actions  R  - 
(a,  b),  and  R2  is  a  relation  between  (0,  l}1  and  R,  as  specified  in  Figure  5. 

The  analysis  treats  Rl  and  R2  separately  and  considers  a  structure  from  which 
all  recursive  calls  have  been  eliminated.  Once  this  structure  has  been  analyzed  by 
the  methods  developed  above,  the  results  are  put  together  using  p(R2),  the 
probability  that  R2  is  called  from  Rl  in  a  given  evaluation.  Recursion  is  then 
taken  into  account  by  multiplying  the  results  by  1/e,  where  e  is  the  overall 
probability  that  no  recursion  will  be  needed. 

We  have  p(R2)  -  0.01  +  0.01  +  0.01  -  0.03;  similarly,  p(Rl),  the  probability 
that  Rl  will  be  called  in  an  evaluation  of  R2,  is  0.25  +  0.25  -  0.5.  The  probability 
that  no  recursive  call  will  be  necessary  is 

e  -  0.01  +  0.0 1+  0.01  +  p(R2)-(0.1  +  0.1  +  0.1  +  0.05  +  0.05  +  0.1)  -  0.045, 
so  that  1/e  -  22.2.  We  can  then  compute  the  maximum  probabilities  of  yielding 
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a  or  b  as 

p(Rl  -  a)  -  [0.01  +  0.01  +  p(R2)-(0.1  +  0.1  +  0.1  +  0.05)]- (1/e)  -0.67, 
p(Rl  -  b)  -  [0.01  +  0.01  +  p(R2)-(0.1  +  0.05  +  0.05)]- (1/e)  -  0.57, 
p(R2  -  a)  -0.1  +0.1  +  0.1  +  0.05  +p(Rl)-p(Rl  -  a )  -0.68, 
p(R2  -  b)  -  0.1  +  0.05  +  0.05  +  p(Rl)-p(Rl  -  6)  -  0.48 

The  tree  activities  are 

am (xi)  -  9-(0  +  0  +  0.02-p(R2  *  b)  +  0  +  0.02-p(R2  *  a)  +  0)  -  0.148, 
aRi(xj)  -  9-<0  +  0.06-p(Rl  #  a)  +  0.02-p(R2  +  b)  +  0  +  0.05  +  0)  -  0.716, 
am(xj)  -  4.5- (0.76 -p(Rl  #  b)  +  0.07  +  0.07  +  0.1)  -  2.524. 

Similarly,  we  get  aiu(yi)  —  10,  ai^yj)  —  10.15,  and  aRj(ys)  —  14.78.  Thus  /,  the 
intrinsic  cost  of  the  relations,  is  the  sum  of  the  tree  activities  of  Rl  and  the  tree 
activities  of  R2  (weighted  by  p(R2))  times  1/e: 

/-  [0.148  +  0.716  +  2.524  +  p(R2)-(10  +  10.15  +  14.78)]- (1/e)  -  98.575. 

The  upper  bound  on  the  cost  is 

C„..  -  [9  +  9  +  4.5  +  p(R2)-(50  +  45  +  36)]. (1/e)  -  587.3. 

Figure  6  describes  a  possible  decision  diagram  D  for  the  relations;  the  diagram 
losses  and  probabilities  appear  beside  each  internal  node.  The  lower  bound  for 
the  cost  of  this  diagram  is  the  sum  of  the  intrinsic  cost  and  of  the  diagram  loss 
of  xs; 

lb(£M  -  98.575  +  43.91  -  142.486. 

The  cost  of  the  diagram  can  be  computed  from  Theorem  2(i): 

C(D>  -  98  575  +  1.43.91  +  0.11-73.93  +  0.04- 148.8 
+  0.02-137.7  +  0.02-97.7  +  0.02-200 
+  3.(0.01-471.5  +  0.007-677.7  +  0.003-370.370)  -  197. 

This  can  also  be  obtained  by  solving  the  diagram's  cost  equation: 

C(D )  -  1-4.5  +  0.85 -C(D)  +  0.11-9  +  0.04-9 

+  0.09 -C(D)  +  0.02-9  +  0.02-9  +  0.02-9 
+  3.(0.01-36  +  0.007-45  +  0.003-50  +  0.005-C(D)), 
yielding  (1  -  0.955) -C(£>)  -  8.865,  so  that  C(D )  -  8.865/0.045  -  197. 

7.  CONSTRUCTING  DECISION  PROCEDURES 

The  construction  of  decision  procedures  with  minimal  expected  testing  costs  is, 
in  many  cases,  a  search  problem;  that  is,  no  algorithm  has  yet  been  devised  that 
does  not  exhibit  an  exponential  behavior  in  at  least  some  cases.  In  particular,  in 
the  case  of  binary  identification  [7],  the  problem  has  been  shown  to  be  NP- 
complete  [8].  This  leads  to  a  search  for  efficient  rules  for  constructing  suboptimal 
procedures. 

As  noted  earlier,  the  loss  lf(x ,)  is  an  approximate  measure  of  the  importance  of 
not  locating  x,  at  the  root  of  the  subfunction  f.  Indeed,  lf(x,)  satisfies  all  the 
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Fig.  7.  (a)  The  tree  constructed  by  the  rule,  (b)  The  optimal  tree. 

requirements  set  forth  in  [6]  for  a  selection  criterion;  that  is, 

(i)  if  a  variable  is  necessary  to  distinguish  all  of  the  m, -tuples  it  forms,  then  its 
activity  is  equal  to  its  cost,  so  that  its  loss  is  null  and  it  will  be  tested  first 
(this  is  an  optimal  strategy,  as  can  easily  be  shown  [6]); 

(ii)  if  a  variable  is  never  needed,  that  is,  if  the  relation  does  not  intrinsically 
depend  on  that  variable,  then  its  activity  is  null  and  its  loss  equal  to  its  cost; 
this  condition  is  easily  detected  and  the  variable  discarded,  unless  all  other 
variables  have  the  same  status  and  the  relation  still  specifies  at  least  two 
distinct  values; 

(iii)  the  loss  is  directly  related  to  the  number  of  m, -tuples  with  equal  components 
(“dash"  entries  in  the  decision  tables  discussed  in  [6]). 

This  leads  to  the  following  rule  for  local  optimization,  a  generalization  of  the 
branch-and-bound  criteria  used  in  [4]  and  [  15]. 

Rule.  When  developing  the  decision  tree  for  the  subfunction  f,  choose  as  the 
root  the  variable  with  the  lowest  loss.  If.  In  case  of  a  tie,  choose  the  variable  with 
the  lowest  cost.  If  a  tie  subsists,  choose  any  of  the  variables. 

In  the  example  of  Section  6,  the  application  of  the  above  rule  would  result  in 
the  diagram  of  Figure  6,  which  is  optimal  in  this  case.  However,  the  rule  does  not 
always  result  in  optimal  diagrams.  In  Example  1  we  had  lf(x i)  -  0.325,  l,(x2)  = 
0.204,  l/{x a)  *=  0.15;  thus  xj  would  be  chosen  as  the  root.  Continuing  in  this 
maimer,  we  would  get  the  tree  of  Figure  7a  with  an  expected  testing  cost  of  1.051, 
but  the  optimal  tree  is  that  shown  in  Figure  7b,  with  an  expected  testing  cost  of 
1.0425.  Thus  the  tree  constructed  by  the  rule  is  1.0425/1.051  =  0.99  optimal.  A 
conservative  estimate  can  always  be  made  by  substituting  the  smallest  lower 
bound  (as  obtained  from  Corollary  1)  for  the  unknown  minimal  cost.  In  the  above 
example,  this  yields  an  estimate  of  0.901/1.051  =  0.86. 

8.  DISCUSSION  OF  THE  SELECTION  RULE 

An  important  advantage  of  the  rule  is  its  simplicity;  compared  to  others  [3,  6, 14, 
17]  it  requires  a  minimum  of  computations.  It  is  also  more  general,  since  it  applies 
to  any  simple  recursive  or  nonrecursive  hierarchy  of  relations  with  costs  and 
probabilities. 
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Moreover,  the  rule  is  optimal  in  several  cases.  As  previously  noted,  it  will 
always  lead  to  the  selection  of  a  totally  necessary  variable  if  any  such  variable 
exists;  such  a  choice  was  seen  to  be  optimal.  We  also  have  the  following  result. 

Proposition  2.  For  any  recursive  relation  on  two  variables,  the  selection 
rule  constructs  optimal  diagrams. 

Proof.  Follows  immediately  from  the  fact  that  the  lower  bound,  as  computed 
from  Corollary  1,  is  the  exact  cost  of  the  diagram.  □ 

Our  previous  example  showed  that  this  result  does  not  hold  for  functions  of 
three  or  more  variables. 

A  more  important  question  is  how  bad  the  selection  can  be.  The  following 
example  illustrates  the  worst  case  for  completely  specified  Boolean  functions  with 
unity  costs. 

Let  f  be  the  Boolean  function  f  •*  x i  +  ©T-jx,,  where  ©  denotes  summation 
modulo  2,  and  assume  the  following  probability  distribution: 

(i)  Each  point  satisfying  xr  ®?.2x,  -  1  has  probability  y-t,  for  y  <  1,  y  =  1. 

(ii)  Each  point  satisfying  Xi  **  0  has  probability  e. 

(iii)  All  other  points  have  probability  a  —  2s-*  -  (y  +  2)-e. 

Then  we  get  af(x\)  *  2"-i-(y  +  1) -c  and  a/(x,)  »  2"“'-eforn  >  i  >  2,  so  that  //(x,> 
<  lf(x\).  The  two  subfunctions  resulting  from  the  choice  of  some  x,,  i  **  1,  as  the 
root  are  again  of  the  form  Xi  +  ©x,,  so  that  the  trees  constructed  by  the  rule  test 
Xi  last  (on  half  the  branches)  and  have  cost  C(  T.)  —  n  -  1  4-  2"~2*  (y  +  1) while 
the  optimal  trees,  rooted  in  Xu  have  a  cost  of  C(Ta)  m  1  +  in  -  l)*2"“'*e.  (The 
case  n  -  4  is  illustrated  in  Figure  8.)  Thus,  if  e  «*:  1  (e.g.,  if  e  *-  2'*"  for  some 
k  >  1),  the  asymptotic  ratio  of  costs  becomes  C(T,)/C(T0)  ■  n  —  1. 

By  letting  every  point  satisfying  x7.©r_2x,  -  I  be  mapped  to  a  recursive  call, 
we  obtain  the  worst  case  for  recursive  Boolean  functions.  The  best  diagram,  0o, 
has  a  cost  of  [1  +  (n  —  l)-2"-l*«J/(l  —  2"“**«),  while  the  rule-constructed  diagram, 
D,,  has  a  cost  of  n/(l  -  2""2-€);  thus  the  asymptotic  ratio  C(D,)/C(Da)  becomes 
approximately  n  for  small «.  That  both  recursive  and  nonrecursive  cases  yield  the 
same  worst  case,  0(n),  is  due  to  the  fact  that  the  recursive  factor  1/e  is 
independent  of  tree  structure  and  is  factored  out. 

The  rule  can  construct  arbitrarily  bad  trees;  however,  in  the  above  example  the 
lower  bound  on  the  cost  of  the  trees  is  lb(/)  -  1  +  (n  -  2)*2""'-t  +  2"“My  +  1). 
e,  so  that  C(7o)  —  lb(/)  —  2"-2 •  (1  —  y)  •«  »  0.  Therefore,  we  could  have  detected 
at  an  early  stage  that  the  trees  constructed  by  the  rule  were  costing  much  more 
than  the  original  lower  bound  and  revised  the  selection.  This  is  not  to  say  that 
the  lower  bound  as  obtained  from  Corollary  1  remains  arbitrarily  close  to  the 
cost  of  the  optimal  trees.  It  is  easy  to  construct  a  binary  identification  problem 
[7]  with  n  variables  of  unity  cost  and  2"'*  equally  likely  objects,  so  that  the  lower 
bound  is  always  1  while  the  optimal  cost  is  n  -  1;  Figure  9  illustrates  such  a 
problem  for  three  variables. 

The  determination  of  a  general  upper  bound  for  the  worst  trees  constructed  by 
the  heuristic  rule,  as  well  as  for  the  lower  bound  obtained  from  Corollary  1,  is  an 
object  of  present  study. 
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X,X: 

Xi 

0 
l 

Fig.  9.  An  identification  problem  with  three  variables  and  four  objects. 


9.  CONCLUSION 

We  have  introduced  the  concept  of  the  activity  of  a  variable,  a  global  measure  of 
the  relevance  of  a  variable  in  determining  the  values  of  a  relation  on  discrete 
arguments.  We  have  proved  a  theorem  detailing  the  relationship  of  this  measure 
to  the  expected  testing  cost  of  the  relation.  Finally,  we  have  used  this  result  to 
develop  a  heuristic  procedure  for  the  fast  construction  of  suboptimal  decision 
diagrams  and  have  indicated  some  bounds  on  its  performance. 

The  applicability  of  these  results  to  recursive  functions  and  decision  diagrams 
with  cycles  should  provide  a  basis  for  further  developments  in  fault  analysis  by 
allowing  sequential  testing  of  time-related  processes,  as  well  as  by  supplying  a 
new  modeling  tool.  Other  areas  in  which  these  results  may  find  applications 
include  pattern  recognition,  database  theory,  and  switching  theory. 
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Decision  trees  and  diagrams  (also  known  aa  sequential  evaluation  procedures)  have 
widespread  applications  in  databases,  decision  table  programming,  concrete  complexity 
theory,  switching  theory,  pattern  recognition,  and  taxonomy— in  short,  wherever  discrete 
(unctions  must  be  evaluated  sequentially.  In  this  tutorial  survey  a  common  framework  of 
definitions  and  notation  is  established,  the  contributions  from  the  main  fields  of 
application  are  reviewed,  recent  results  and  extensions  are  presented,  and  areas  of 
ongoing  and  future  research  are  discussed. 
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INTRODUCTION 

A  decision  tree  or  diagram  is  a  model  of  the 
evaluation  of  a  discrete  function,  wherein 
the  value  of  a  variable  is  determined  and 
the  next  action  (to  choose  another  variable 
to  evaluate  or  to  output  the  value  of  the 
function)  is  chosen  accordingly.  Decision 
trees  find  many  applications  in  decision 
table  programming  [Silb71,  Pooc74, 
Metz77],  databases  [Wong76,  Hana77], 
pattern  recognition  [Haus75,  Bell78],  tax¬ 
onomy  and  identification  [Jard71, 
Mors71,  Gare72s,  PayR80,  WillSO],  ma¬ 
chine  diagnosis  [Klet60,  Chan70],  switch¬ 


ing  theory  [Lee59,  Thay818],  and  analysis 
of  algorithms  [Weid77],  More  recently, 
they  have  been  proposed  as  implementa¬ 
tion-independent  models  of  discrete  func¬ 
tions  with  a  view  to  the  development  of 
new  testing  methods  [Aker79,  More81s] 
and  complexity  measures  [MoRE80a]. 

Owing  to  this  broad  applicability,  results 
about  decision  trees  are  dispersed  through¬ 
out  the  literature  in  fields  such  as  biology, 
computer  science,  information  theory,  and 
switching  theory;  moreover,  there  is  no 
common  notation  or  set  of  definitions. 
Therefore  this  article  begins  by  establishing 
a  framework  of  notation  and  definitions 
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that  introduce  decision  trees  and  diagrams 
and  the  various  measures  associated  with 
them.  Particular  attention  is  paid  to  the 
problem  of  constructing  decision  trees  and 
diagrams  from  function  descriptions  and 
evaluating  their  efficiency.  The  complexity 
of  such  constructions  is  detailed,  and  reper¬ 
cussions  on  circuit  or  program  design  ana¬ 
lyzed.  A  survey  of  the  main  fields  of  appli¬ 
cation  and  related  results  follows.  Recently 
proposed  extensions  (to  include  diagram 
composition  and  recursion)  and  applica¬ 
tions  (e.g.,  to  system  testing)  are  then  dis¬ 
cussed.  The  article  concludes  with  an  as¬ 
sessment  of  known  results  and  suggestions 
for  future  research. 

The  emphasis  throughout  this  exposition 
is  on  Boolean  functions,  since  they  find 
many  more  applications  and  are  more  read¬ 
ily  understood  than  general  discrete  func¬ 
tions.  The  presentation  alternates  formal 
exposition,  examples,  and  discussion;  com¬ 
plex  proofs  are  avoided  (the  reader  will  find 
them  in  the  references),  and  the  mathe¬ 
matical  content  is  kept  to  the  minimum 
necessary  for  clarity  and  conciseness.  In 


particular,  all  necessary  mathematical  and 
other  background  is  introduced  in  the  first 
section  so  that  the  paper  should  be  acces¬ 
sible  to  any  reader  with  a  mathematical  or 
algorithmic  bent.  The  intent  is  to  cover  the 
breadth  of  the  field,  unify  terminology,  con¬ 
vey  the  import  of  the  main  results,  and  act 
as  a  guide  to  the  literature,  for  which  last 
purpose  a  representative,  rather  than  ex¬ 
haustive,  reference  list  is  provided.  As  such, 
this  survey  should  be  of  interest  to  both 
practitioners  and  researchers  in  the  areas 
mentioned  above. 

1.  PRELIMINARIES 

Since  the  evaluation  of  Boolean  functions, 
the  programming  of  decision  tables,  and  the 
identification  of  unknown  objects  (biologi¬ 
cal  specimens,  system  faults,  etc.)  are 
among  the  most  important  applications  of 
decision  trees  and  diagrams,  we  provide  a 
succinct  review  of  the  terminology  and 
basic  concepts  of  Boolean  functions,  deci¬ 
sion  tables,  and  identification  problems. 
Readers  who  feel  comfortable  with  these 
topics  may  wish  to  skip  to  Section  2. 

1.1  Discrete  and  Boolean  Functions 

Only  a  very  brief  review  is  provided;  for 
more  details,  the  reader  is  referred  to 
Davi80  on  discrete  functions  and  to 
Harr65  on  Boolean  functions. 

By  discrete  function,  we  mean  a  (partial) 
function  of  discrete  variables,  f(x j, . . . ,  xn), 
where  each  variable,  x„  takes  exactly  m, 
values,  which  we  choose  to  denote 
0,  ....  m,  -  1.  A  discrete  function  is  con¬ 
stant  if  and  only  if  (iff)  it  assumes  the  same 
value  wherever  it  is  defined;  it  is  null  if  it  is 
not  defined  in  any  point  of  its  domain, 
completely  specified  if  it  is  defined  every¬ 
where.  When  a  variable  is  evaluated,  say 
x,  -  k,  we  are  left  with  the  restriction,  f(x i, 

. . . ,  x,-i,  k,  . . . xn),  which  we  denote 

f  1 1,-*.  A  variable,  x„  is  redundant  iff 

f  I  »,-o  ”  •  -  *  *■  /J  1,  (1) 

where  two  functions  are  equal  if  they  have 
the  same  domain  and  codomain  and  assume 
the  same  value  wherever  they  are  both 
defined;  a  function  without  redundant  vari¬ 
ables  is  called  intrinsic.  Finally,  a  variable, 
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x„  is  termed  indispensable  iff  it  is  not  re¬ 
dundant  in  any  restriction  resulting  from 
the  evaluation  of  any  subset  of  the  variables 
{xlt ....  x,-u  Xi+\, . . . ,  *,} .  (This  implies 
that  a  function  can  never  be  evaluated  at 
any  point  without  knowledge  of  the  values 
of  all  indispensable  variables.) 

A  Boolean  function  of  n  variables  is  a 
discrete  function,  f:  {0, 1}"  — >  (0, 1},  where 
(0, 1}”  denotes  the  n-fold  Cartesian  product 
of  (0, 1},  that  is,  the  set  of  all  binary  n- 
tuples.  Each  n-tuple,  (*i, ....  x„),  mapped 
to  1  by  the  function  is  a  minterm  of  the 
function.  A  Boolean  function  can  be  speci¬ 
fied  by  describing  the  mapping  (giving  its 
“truth  table")  or  by  listing  its  minterms  and 
those  points  at  which  it  is  not  defined  (so- 
called  “don’t  care”  conditions);  it  can  also 
be  represented  by  a  Boolean  formula,  usu¬ 
ally  in  terms  of  the  three  operations  of 
disjunction  (4-),  conjunction  (  •  ),  and  com¬ 
plementation  ( ').  A  Boolean  function  of  n 
variables  can  be  expressed  in  terms  of  two 
functions  of  n  —  1  variables  by  means  of 
Shannon ’s  expansion  theorem 

/(x  . . Xn)  *  X,  -  f\ O  +  X|  •  f  |i,-i  (2) 

for  each  choice  of  x,. 

Example  1 

Consider  the  Boolean  function  of  three 
variables,  fix t,  x2,  xa),  given  by  the  mapping 


f :  (0,0,0)  -  0  (1,0,0) 

(0,0,1)  -*  0  (1,0,1) 

(0,1,0)  -»  1  (1,1,0) 

(0,1,1)  -*  1  (1,1,1) 


0 

0 

0 

1 


Since  every  point  in  the  domain  is  assigned 
a  value,  the  fiinction  is  completely  specified. 
Other  representations  for  fare  the  list  of  its 
minterms 


{(0,1,0),  (0, 1, 1),  (1, 1,1)} 

or  a  Boolean  formula 


X1X2X3  +  X1X2X3  +  X 1X2X3. 


The  latter  formula  is  equivalent  to  the  list 
of  minterms;  it  can  be  simplified  to  yield 
the  minimum  expression 

xlx2  +  X2X3. 
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Table  1.  Decision  Table.  Example  2 


Raining? 

Yea 

No 

No 

Wind  condition 

Breezy 

Calm  Wind 

y 

Clean  basement 
Spade  garden 

Fly  kite  with  chil¬ 
dren 

X 

X 

X 

X 

It  is  easily  verified  that  the  function  is 
intrinsic;  expanding  it  around  x»  yields 

/■*X2*0  +  x2*(xI  +  xs).  □ 
1.2  Decision  Tables 

The  terminology  used  in  the  following  is 
that  of  Metz77;  other  general  references 
are  Silb71  and  Pooc74. 

A  decision  table  is  an  organizational  or 
programming  tool  for  the  representation  of 
discrete  functions.  It  can  be  viewed  as  a 
matrix  where  the  upper  rows  specify  sets  of 
conditions  and  the  lower  ones  Bets  of  ac¬ 
tions  to  be  taken  when  the  corresponding 
conditions  are  satisfied;  thus  each  column, 
called  a  rule,  describes  a  procedure  of  the 
type  “if  conditions,  then  actions." 

Example  2 

Table  1  describes  how  to  spend  a  Saturday 
afternoon  in  spring.  It  has  two  condition 
rows,  three  action  rows,  and  four  rules;  the 
fust  condition  is  a  binary  variable  (taking 
values  “yes”  or  “no”),  while  the  second  is  a 
ternary  variable  (taking  values  “calm,” 
“breezy,”  or  "windy”).  According  to  normal 
practice  [Metz77],  condition  and  action 
names  are  used  as  labels  on  appropriate 
rows  and  a  rule  is  specified  by  entering 
values  in  the  condition  rows  (or  blanks,  for 
don’t  care  conditions)  and  X’s  (meaning 
“execute”)  in  the  action  rows.  The  four 
rules  can  be  read  as 

“if  it  is  raining,  then  clean  the  basement”; 
"if  it  is  breezy  and  not  raining,  then  fly  kite 
with  children”; 

“if  it  is  calm  and  not  raining,  then  spade 
the  garden”;  *  _ 

“if  it  is  windy,  then  clean  the  basement.”  □ 

A  pair  of  rules  overlaps  if  a  combination 
of  condition  values  can  be  found  that  sat- 
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Table  2.  Decision  Table,  Example  3 


Raining? 

Yes  No 

No 

Calm? 

(No) 

Yes 

(No) 

Breezy? 

Yes 

(No) 

(No) 

Clean  basement 
Spade  garden 

Fly  kite  with  chil¬ 
dren 

X 

X 

X 

X 

isfies  the  condition  sets  of  both  rules.  If  two 
overlapping  rules  specify  different  actions, 
they  are  called  inconsistent  and  the  table 
is  said  to  be  ambiguous ;  if  they  specify 
identical  actions,  they  are  termed  redun¬ 
dant. 

Decision  tables  described  so  far  are  in  so- 
called  extended-entry  form.  Often,  how¬ 
ever,  it  is  required  that  all  conditions  be 
Boolean  variables;  this  gives  rise  to  limited- 
entry  decision  tables.  Although  most  such 
tables  are  set  up  in  limited  format  from 
their  conception,  it  may  be  necessary  to 
convert  extended-entry  tables  to  limited- 
entry  format;  this  is  done  by  using  one 
Boolean  variable  for  each  value  (but  one) 
of  the  multivalued  variable  to  be  replaced 
[Pres65].  This  process  results  in  tables 
where  entries  in  one  condition  row  often 
imply  (absent)  entries  in  others;  such  im¬ 
plied  entries  can  also  be  present  in  any 
decision  table  and  give  rise  to  apparent  (but 
nonexistent)  ambiguity.  Since  the  implica¬ 
tions  result  from  purely  semantic  consid¬ 
erations,  they  cannot  be  detected  by  an 
automatic  processor,  so  that  they  must  be 
explicitly  specified.  The  impossible  combi¬ 
nations  of  conditions  will  then  be  treated  as 
inputs  with  unspecified  mapping. 

Example  3 

In  Table  1,  Rules  1  and  4  overlap  because 
they  are  both  applicable  when  it  is  raining 
and  windy.  Since  they  specify  the  same 
action  set,  they  are  redundant,  and  since  no 
other  rules  overlap,  the  table  is  unambigu¬ 
ous.  Table  2  is  the  same  table,  converted  to 
limited-entry  format.  It  still  has  three  ac¬ 
tion  rows  and  four  rules,  but  now  has  three 
condition  rows.  Implied  entries  are  shown 
in  parentheses;  their  absence,  while  not 
confusing  to  a  human,  would  induce  an 
automatic  processor  to  decide  that  the  sec¬ 


ond  and  third  rules  are  inconsistent,  since 
both  could  apparently  apply  when  it  is  not 
raining  and  it  is  calm  and  breezy.  It  is  noted 
that  the  specification  of  implied  entries  in 
the  table  is  insufficient;  while  it  identifies 
the  impossible  condition  set  (no,  yes,  yes), 
it  fails  to  identify  the  equally  impossible  set 
(yes,  yes,  yes),  which  will  be  erroneously 
included  in  the  first  rule.  This  suggests  that 
logical  inconsistencies  be  separately  listed 
[King73];  for  instance,  the  above  table 
would  be  supplemented  by  the  logical 
expression  NOT  (breezy  AND  calm).  □ 

It  should  now  be  clear  that  an  unambig¬ 
uous  extended-entry  decision  table  is  a  spe¬ 
cial  case  of  a  partial  function  of  multivalued 
variables,  where  the  conditions  correspond 
to  the  variables  and  the  action  sets  to  the 
function  values.  In  particular,  a  complete 
decision  table  (one  which  has  an  applicable 
rule  for  every  combination  of  conditions) 
corresponds  to  a  completely  specified  func¬ 
tion,  and  a  limited-entry  decision  table  cor¬ 
responds  to  a  function  of  binary  variables. 
An  ambiguous  decision  table  can  be 
modeled  by  a  relation,  as  discussed  later.  A 
sequential  evaluation  procedure  for  a  deci¬ 
sion  table  is  then  of  particular  importance, 
since  it  corresponds  to  an  implementation, 
usually  in  software,  of  the  decision  table; 
indeed,  the  importance  of  the  limited-entry 
format  is  in  good  part  due  to  the  ease  of 
programming  binary  decisions  (by  if-then- 
else  constructs)  [Metz77], 

1 .3  Identification  Problems 

Consider  the  situation  where  an  unknown 
event  or  specimen  is  to  be  classified  into 
one  of  a  finite  number  of  known  categories, 
based  upon  the  outcome  of  a  number  of 
tests.  (This  is  a  special  case  of  the  concept 
of  questionnaires  [Pica72].)  Such  identifi¬ 
cation  problems  arise  in  biology,  medical 
diagnosis,  machine  trouble  shooting,  and 
numerous  pattern  recognition  applications. 
A  binary  identification  problem  includes 
only  binary  tests. 

As  defined  in  GARE72a,  a  binary  identi¬ 
fication  problem  consists  of  a  finite  set  of 

objects,  ( O . .  On),  which  represents  the 

universe  of  possible  identifications,  and  a 
finite  set  of  tests  (T,, . . . ,  Tm),  each  of 
which  is  a  function  from  the  set  of  objects 
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to  the  set  (yes,  no)  (i.e.,  each  test  applied 
to  a  specific  object  gives  a  specific  yes/no 
answer).  In  a  simple  binary  identification 
problem,  all  tests  are  of  the  form  “is  the 
unknown  object  of  type  i?”;  that  is,  their 
outcome  is  no  for  all  but  one  object 
[GARE72b].  An  optional  probability  distri¬ 
bution  can  be  specified  on  the  set  of  objects, 
giving  the  a  priori  probability  that  an  un¬ 
known  object  will  be  identified  as  each  ob¬ 
ject  in  the  set.  In  practical  situations,  the 
number  of  tests  and  the  number  of  objects 
are  often  in  the  same  range  (even  though, 
in  theory,  the  number  of  tests  could  be 
arbitrarily  larger  than  that  of  objects  and 
the  number  of  objects  could  grow  as  an 
exponential  function  of  the  number  of 
tests). 

In  our  terminology,  the  tests  are  binary 
variables  and  the  objects  are  values  of  a 
partial  bijective  function  from  the  vari¬ 
ables’  space  to  the  set  of  objects.  In  biology, 
tests  are  also  known  as  characters  or  con¬ 
ditions  and  objects  as  taxa  or  formae,  while 
the  specification  of  an  identification  prob¬ 
lem  is  known  as  a  diagnostic  table;  in  pat¬ 
tern  recognition,  tests  would  be  known  as 
features  and  objects  as  classes;  in  question¬ 
naire  theory,  the  tests  are  questions  and 
their  outcomes  are  responses. 

Example  4 

Consider  the  identification  problem  given 
by  a  set  of  five  objects,  {a,  b,  c,  d,  e) ,  a  set 
of  four  tests,  {Ti  =  {a},  Ti  -  {a,  b),  T3  *■ 
{a,  b,  c} ,  T«  «  (a,  b,  c,  d) } .  The  same  prob¬ 
lem  can  be  represented  as  a  diagnostic  table 
(see  Table  3),  in  which  the  rows  correspond 
to  objects  and  the  columns  to  tests.  In  our 
terminology,  we  have  a  partial  bijective 
function  of  four  binary  variables,  defined  in 
five  points  and  given  by  the  following  map¬ 
ping: 


(yes,  yes,  yes,  yes)  — *  a 
(no,  yes,  yes,  yes)  — *  b 
(no,  no,  yes,  yes)  — ►  c 

(no,  no,  no,  yes)  -»  d 

(no,  no,  no,  no)  -*  e  □ 


This  formulation  provides  a  clean,  but 
very  much  simplified  model  of  the  general 
identification  problem.  In  pattern  recogni¬ 
tion  applications,  the  test  results  are  gen¬ 
erally  not  as  clearly  defined;  instead,  each 
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Table  3.  Diagnostic  Table.  Example  4 


T, 

T, 

T, 

T, 

a 

Ye* 

Yet 

Yet 

Yes 

b 

No 

Yes 

Yet 

Yes 

c 

No 

No 

Yet 

Yes 

d 

No 

No 

No 

Yes 

e 

No 

No 

No 

No 

test  may  take  any  of  the  possible  values,  as 
specified  by  a  discrete  probability  function. 
The  same  problem  arises  in  biology  (where 
it  is  known  as  probabilistic  identification); 
however,  since  the  probability  distribution 
is  often  unknown,  a  confidence  threshold  is 
normally  used,  which  dichotomizes  test 
outcomes  as  “known”  (taking  a  specific 
value)  or  “variable”  (susceptible  of  taking 
any  value).  In  the  following,  we  shall  first 
look  at  Garey’s  model,  then  generalize  re¬ 
sults  to  include  the  probabilistic  identifica¬ 
tion  model. 

2.  DEFINITIONS 

2.1  Decision  Trees  and  Diagrams 

A  decision  tree  (or  diagnostic  key,  as  it  is 
known  in  many  identification  applications) 
can  be  regarded  as  a  deterministic  algo¬ 
rithm  for  deciding  which  variable  to  test 
next,  based  on  the  previously  tested  vari¬ 
ables  and  the  results  of  their  evaluation, 
until  the  function’s  value  can  be  deter¬ 
mined.  Several  constraints  are  placed  upon 
such  an  algorithm;  all  are  designed  to  pre¬ 
vent  clearly  redundant  testing.  The  follow¬ 
ing  formal  definition  of  a  decision  tree  is 
taken  from  MoRE80b. 

Definition  1 

Let  f(x i , . . . ,  x„)  be  a  (partial)  function  of 
discrete  variables.  If  /  is  constant  or  null, 
then  the  decision  tree  for  f  is  composed  of 
a  single  leaf  labeled  by  the  constant  value 
or  by  the  null  symbol.  Otherwise,  for  each 
x„  1  «  i  <  n,  such  that  at  least  two  restric¬ 
tions,  say  f\x-k,  and  /I,-*,,  are  not  null,  f 
has  one  or  more  decision  trees  composed  of 
a  root  labeled  x,,  and  m,  subtrees,  which 
are  decision  trees  corresponding  to  the  re¬ 
strictions  /U,_o . fUrm.-u  “ 

order.  □ 

This  recursive  definition  closely  parallels 
the  conventional  definition  of  ordered  trees, 
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such  as  binary  trees  [Knut73];  it  defines 
decision  trees  as  rooted,  ordered,  vertex- 
labeled  trees,  where  each  node  has  either 
m,  children  for  some  i,  1  <  i  «  n,  or  none 
(and  is  a  leaf).  To  an  extent,  this  definition 
prevents  redundant  testing:  a  variable  is 
tested  only  once  on  any  path;  no  variable 
can  be  tested  which  would  result  in  all 
restrictions  but  one  being  null;  and  no  more 
testing  can  take  place  as  soon  as  the  func¬ 
tion  has  been  reduced  to  a  constant. 

The  evaluation  of  a  discrete  function  rep¬ 
resented  as  a  decision  tree  starts  by  ascer¬ 
taining  the  value  of  the  variable  associated 
with  the  root  of  the  tree.  It  then  proceeds 
by  repeating  the  process  on  the  Ath  subtree, 
where  A  is  the  value  assumed  by  the  root 
variable,  until  a  leaf  is  reached;  the  label  of 
the  leaf  gives  the  value  of  the  function. 

Example  5 

The  Boolean  function  of  Example  1  was 
given  by  the  formula 

fix l,  X*,  X3)  -  TiXi  +  x2x3. 

A  possible  decision  tree  for  that  function  is 
shown  in  Figure  1.  Since  decision  trees  are 
ordered,  the  left  subtree  of  a  node  corre¬ 
sponds  to  the  node’s  variable  evaluating  at 
0,  the  right  subtree  to  the  variable  evalu¬ 
ating  at  1.  Thus  the  left  subtree  of  the  root 
corresponds  to  the  restriction 

f  I  *,-o  m  X2, 

and  the  right  subtree  corresponds  to  the 
restriction 

/I*, —I  ■  X2X3. 

Evaluation  on  the  tree  for  the  triple  of 
values  (1,  0,  0)  starts  by  examining  xi ;  on 
finding  it  to  be  1,  it  proceeds  to  the  right 
subtree,  there  to  evaluate  x3.  Since  x%  is 


found  to  be  0,  the  left  subtree  is  next  used, 
thereby  encountering  a  leaf  and  terminat¬ 
ing  the  evaluation,  having  used  only  two  of 
the  three  variables;  the  label  of  the  leaf  is 
the  value  of  the  function,  that  is,  fil,  0, 0) 
■  0.  It  is  noted  that  a  decision  tree  for  a 
Boolean  function  is  an  explicit  illustration 
of  Shannon’s  expansion;  the  tree  of  Figure 
1  represents  the  expansion 

fix i,X2,  x3)  -  xi  •  [xj  •  0  +  xt  •  1] 

+  xi  •  [xs  •  0  +  Xa  •  (x2  •  0  +  X2  •  1)]. 

□ 

We  note  that  the  same  subtree  may  occur 
on  several  branches  of  the  tree,  in  which 
case  it  may  be  desirable  to  use  only  one 
copy  of  that  subtree  by  transforming  the 
decision  tree  (through  a  process  known  as 
reticulation  [PayR77])  into  a  simple  deci¬ 
sion  diagram,  which  has  the  structure  of  a 
rooted,  directed,  acyclic  (hyper)graph.  In 
the  case  of  Boolean  functions,  further  re¬ 
quiring  that  there  be  only  one  leaf  labeled 
1  (the  “finish”  node)  yields  a  free  Boolean 
graph.  Of  course,  we  can  choose  which 
identical  subtrees  to  merge,  if  any;  in  par¬ 
ticular,  every  decision  tree  is  a  (simple) 
decision  diagram.  To  every  simple  decision 
diagram  there  corresponds  a  unique  deci¬ 
sion  tree;  moreover,  the  paths  in  the  dia¬ 
gram  are  in  one-to-one  correspondence 
with  those  in  the  tree.  Conversely,  to  every 
decision  tree  for  a  completely  specified 
function  there  corresponds  a  unique 
“minimal"  diagram,  that  is,  one  in  which 
every  possible  merge  has  been  accom¬ 
plished.  Reticulation  sometimes  also 
merges  nonidentical  subtrees  while  preserv¬ 
ing  the  identity  of  the  function;  in  such 
cases,  it  may  happen  that  a  variable  occurs 
more  than  once  along  a  path  from  the  root 
to  a  leaf.  Such  nonsimple  decision  diagrams 
may  further  decrease  the  total  number  of 
nodes  required;  however,  the  second  test  of 
a  variable  is  redundant  and  thus  detracts 
from  the  diagram’s  “efficiency.” 

Decision  diagrams  (and  their  correspond¬ 
ing  trees)  can  easily  be  programmed.  Lee59 
calls  the  result  (simple)  decision  programs 
and  has  suggested  a  universal  instruction 
type  which  implements  the  evaluation 
process  taking  place  at  an  internal  node: 

f*  •  It  gO,  •  »  •  ,  gmt- 1 , 
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Figure  2.  The  minimal  free  Boolean  graph  of  Ex¬ 
ample  6. 

where  L  is  a  label,  i  identifies  variable  x„ 
and  gk,  used  only  when  x<  «•  k,  is  either  a 
value  (if  the  restriction  for  Xj  ■*  k  is  a 
constant)  or  a  label.  Such  an  instruction  is 
executed  by  testing  variable  x,  and  upon 
finding  its  value,  say  Xi  —  k,  taking  the  cor¬ 
responding  action,  gk,  that  is,  either  trans¬ 
ferring  control  to  the  instruction  labeled  gk 
or  assigning  to  the  function  the  value  gk- 
Thus  to  each  node  of  the  diagram  there 
-corresponds  one  instruction  in  the  program. 
Cemy  [CERN79b]  has  investigated  a  spe¬ 
cial-purpose  architecture  for  the  execution 
of  such  programs. 

Decision  trees  and  diagrams  for  Boolean 
functions  find  yet  another  implementation, 
this  time  in  hardware  as  multiplexer  trees 
and  networks  [CERN79a,  THAY81a].  In  a 
multiplexer  tree  (network),  each  internal 
tree  (diagram)  node  is  represented  as  a  2-1 
multiplexer  controlled  by  the  node  variable 
and  each  leaf  is  implemented  as  a  constant 
logical  value  (wired  at  0  or  wired  at  1);  the 
interconnection  scheme  is  that  of  the  deci¬ 
sion  tree  (diagram).  The  evaluation  of  a 
function  then  proceeds  from  the  “leaves” 
(the  constant  values)  to  the  “root”  multi¬ 
plexer,  the  function  variables,  used  as  con¬ 
trol  variables,  select  a  unique  path  from  the 
root  to  one  leaf,  and  the  value  assigned  to 
that  leaf  propagates  along  the  path  to  the 
output  of  the  “root”  multiplexer. 

Example  6 

Consider  the  tree  of  Example  5.  It  is  itself 
a  diagram;  merging  the  identical  subtrees 
rooted  in  nodes  labeled  xt  and  the  identical 
leaves  results  in  the  minimal  free  Boolean 
graph  pictured  in  Figure  2.  This  diagram 
can  be  implemented  by  the  decision  pro¬ 
gram  (where  letters  are  used  for  labels  to 
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Figure  3.  The  multiplexer  network  of  Example  6. 

distinguish  them  from  values) 

Start:  1,  A,  B 
A:  2,0,1 
B:  3,0,  A 

Figure  3  depicts  an  equivalent  multiplexer 
network.  We  note  that,  as  a  consequence  of 
our  definitions,  the  number  of  multiplexers 
used  in  a  network  is  precisely  the  number 
of  instructions  of  an  equivalent  decision 
program;  similarly,  the  maximum  delay 
through  a  network  is  proportional  to  the 
maximum  execution  time  of  an  equivalent 
program,  both  being  dependent  upon  the 
length  of  the  longest  path  through  the  dia¬ 
gram.  □ 

2.2  Measures  on  Decision  Trees 
and  Diagrams 

Since  the  root  of  each  subtree  can  be  la¬ 
beled  with  any  (up  to  the  restrictions  of 
Definition  1)  of  the  untested  variables,  the 
number  of  possible  decision  trees  for  a  given 
function  is  in  general  very  large  (and  that 
of  possible  decision  diagrams  even  larger). 
For  instance,  the  function  of  Example  5  has 
ten  distinct  decision  trees,  as  shown  in  Fig¬ 
ure  4.  In  fact,  a  completely  specified  Boo¬ 
lean  function  of  n  variables  can  have  up  to 

Nr(n)  -  n’  (n  -  i)*  (3) 

4-0 

distinct  decision  trees.  Indeed,  n  choices  are 
possible  for  the  root,  followed  by  n  —  1 
choices  on  each  of  the  two  subtrees,  or 
(n  -  1)*  choices;  in  general,  up  to  (n  -  A)2* 
choices  are  possible  at  depth  k.  This  cor¬ 
responds  to  the  recurrence  relation 

Nr(n)-n  -  (Nrin-l))*, 

which  shows  that  NAn)  grows  faster  than 
2**.  The  first  few  values  of  NAn)  are  listed 
in  Table  4. 
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A 

NtM 

A 

NtM 

i 

1 

6 

1.65. 10'1 

2 

2 

7 

1.91  •  10*7 

3 

12 

8 

2.91  •  10“ 

4 

676 

9 

7.64- 10"’ 

S 

1658880 

10 

5.84. 10*“ 

Not  all  tree  or  diagram  representations 
of  a  function  are  equally  desirable.  Thus 
several  criteria  have  been  developed  in  or¬ 
der  to  select  an  appropriate  representation; 
such  criteria  attempt  to  measure  impor¬ 
tant  properties  of  decision  trees  and  dia¬ 
grams  such  as  their  implementation  and 
usage  costs. 

In  the  most  general  case,  each  variable 
has  an  associated  testing  cost,  which  mea¬ 
sures  the  expense  (e.g.,  in  time)  incurred 
each  time  that  variable  is  evaluated,  and  a 
storage  cost,  which  measures  the  expense 
(e.g.,  in  memory)  due  to  the  presence  of 
each  test  node  labeled  by  that  variable.  In 
addition,  a  probability  distribution  is  often 
specified  on  the  variables'  space,  which  can 
be  assumed  uniform  if  not  otherwise 
known.  These  data  allow  the  computation 
of  the  following  measures.  (These  and  other 
criteria  are  discussed  in  depth  in 
MoRE81b.) 

Definition  2 

(i)  The  worst  case  testing  cost,  h,  is  the 
maximum  path  testing  cost.  When  all 
testing  costs  are  unity,  h  reduces  to 
the  worst  case  number  of  tests,  that  is, 
the  height  of  the  tree  or  diagram. 

(ii)  The  expected  testing  cost,  E,  is  the 
expected  value  of  the  path  testing  cost, 
where  the  probability  of  a  path  is  the 
sum  of  the  probabilities  of  all  the  com¬ 
binations  of  variables’  values  that  se¬ 
lect  that  path.  When  all  testing  costs 
are  unity,  E  reduces  to  the  expected 
number  of  tests. 

(iii)  The  tree  storage  cost,  a,  is  the  sum  of 
the  storage  costs  of  the  internal  nodes 
of  the  tree.  When  all  costs  are  unity, 
a  reduces  to  the  number  of  internal 
nodes  of  the  tree. 

(iv)  The  diagram  storage  cost,  P,  is  the  sum 
of  the  storage  costs  of  the  internal 
nodes  of  the  minimal  diagram.  □ 


In  the  case  of  unity  costs  and  uniform  prob¬ 
ability  distribution,  the  only  datum  needed 
to  compute  the  first  three  measures  (those 
applicable  to  trees)  is  the  number  of  leaves 
at  each  level  of  the  tree.  Thus  a  decision 
tree  for  a  function  of  n  variables  can  be 
characterized  by  an  (n  +  l)-tuple,  the  leaf 
profile  [MoRE80a],  (Ao, . . . ,  AJ,  where  A.  is 
the  number  of  leaves  at  depth  L  The  leaf 
profile  induces  a  lexicographic  ordering  on 
decision  trees,  thereby  giving  rise  to  two 
additional  measures:  (1)  the  maximum  pro¬ 
file,  which  ranks  as  “best"  that  tree  which 
is  largest  in  lexicographic  order  (on  the 
grounds  that  leaves  should  be  encountered 
as  soon  as  possible),  and  (2)  the  minimum 
reverse  profile,  which  ranks  as  “best”  that 
tree  which  is  smallest  in  reverse  lexico¬ 
graphic  order  (on  the  grounds  that  the 
number  of  long  paths  should  be  minimized). 

Example  7 


Assume  storage  costs  s,  testing  costs  t,  and 
probability  distribution  p,  for  the  function 
of  Example  5: 


s:  Xi  1 

*2  -*  2  Xs  -*+  3 

t :  xi  -»  1 

Xi  —*  2  Xs  — *  6 

p:  (0,0,0) 

->  0.10 

(1,0,0) 

-e 

0.05 

(0,0,1) 

-*  0.15 

(1,0,1) 

— » 

0.05 

(0, 1,0) 

-»  0.05 

(1, 1,0) 

— » 

0.25 

(0, 1, 1) 

->  0.20 

(1, 1, 1) 

-» 

0.15 

Figure  5  shows  the  tree  of  Figure  1  with  its 
node  probabilities.  The  expected  testing 
cost,  E,  of  the  tree  is 

E  -  (0.25  +  0.25)  •  (1  +  2)  +  0.3  •  (1  +  6) 

+  (0.05  +  0.15)  •  (1  +  6  +  2) 

-  5.4. 

The  tree’s  profile  is  (0, 0, 3, 2),  and  its  var- 
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FHju r»  6.  Two  possible  decision  trees 
for  the  problem  of  Example  8. 


ious  measures  are 


Measure  Value 


h  (wont  case  testing  cost)  9 

Height  (worst  case  number  of  tests)  3 

E  (expected  testing  cost)  5.4 

Expected  number  of  tests  2.2 

a  (storage  cost)  8 

Node  count  4  □ 


Finally,  when  ascertaining  the  value  of  a 
variable  requires  a  costly  apparatus  (or  sub¬ 
routine),  it  may  be  desirable  to  minimize 
the  total  cost  incurred  through  the  acqui¬ 
sition  of  such  apparatus  for  each  variable 
used  in  the  tree;  we  call  this  criterion  the 
total  acquisition  cost.  Minimizing  this  cost 
is  a  common  problem  in  biological  identifi¬ 
cation  [PayR80,  Will80],  where  the  num¬ 
ber  of  tests  often  exceeds  the  number  of 
taxa;  the  objective  is  to  find  the  smallest 
subset  of  tests  that  still  separates  all  taxa, 
which  corresponds  to  minimizing  the  total 
acquisition  cost  when  all  tests  have  unity 
acquisition  costs.  Clearly,  such  a  cost  is 
fixed  (and  maximal)  for  intrinsic  functions, 
since  all  variables  must  be  tested,  and  thus 
evaluated  at  some  point  or  other  in  the  tree. 
In  fact,  this  cost  is  better  associated  with 
the  functions  rather  than  the  trees. 

2.3  Binary  Identification 

In  the  simplified  model  of  binary  identifi¬ 
cation  expounded  by  GARE72a,  there  is  ex¬ 
actly  one  combination  of  test  values  asso¬ 
ciated  with  each  object.  Therefore  decision 
trees  for  such  problems  have  a  fixed  num¬ 
ber  of  leaves  (one  per  object)  and  thus  of 
internal  nodes  (since  the  number  of  internal 
nodes  of  a  binary  tree  is  one  less  than  the 
number  of  its  leaves),  so  that  their  storage 
cost  is  simply  equal  to  one  less  than  the 
number  of  objects  when  all  costs  are  unity. 
Moreover,  since  no  two  leaves  are  identical, 


there  can  be  no  common  subtrees,  so  that 
decision  diagrams  for  identification  prob¬ 
lems  are  decision  trees. 

In  the  even  simpler  case  of  simple  binary 
identification,  a  yes  answer  immediately 
identifies  the  object  and  so  terminates  the 
evaluation,  so  that  at  most  one  path  (that 
corresponding  to  the  “no”  answer)  leads 
from  a  test  to  another,  this  corresponds  to 
a  degenerate  tree  with  a  number  of  internal 
nodes  equal  to  its  height  Only  one  optimi¬ 
zation  criterion,  the  expected  testing  cost, 
is  applicable,  and  the  optimization  can  be 
done  step  by  step  using  a  simple  ordering 
of  tests  in  terms  of  their  cost  to  probability 
ratio  [John56,  Ries63,  Slag64,  GARE72b]. 
Similar  conditions  arise  when  a  Boolean 
function  is  evaluated  in  a  linear  (as  opposed 
to  tree-structured)  sequence  [Hana77], 

Example  8 

Two  possible  decision  trees  for  the  binary 
identification  problem  of  Example  4  are 
illustrated  in  Figure  6.  Both  trees  have  a 
storage  cost  of  4;  their  other  measures  are 

Expected 
number  of 

Tree  Leaf  profile  Height  teste 

Left  (0.0, 3, 2,0)  3  2A 

Right  (0,1,1, 1,2)  4  2.8  □ 

3.  OPTIMIZATION 

In  most  applications,  decision  trees  and  dia¬ 
grams  must  be  constructed  from  function 
descriptions.  Since,  as  previously  observed, 
numerous  tree  representations  can  be  built, 
with  varying  usage  costs,  we  naturally 
strive  to  construct  a  decision  tree  which 
optimizes  a  suitable  measure.  This  en¬ 
deavor  raises  several  questions. 
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(i)  Since  the  choice  of  an  optimization 
criterion  can  be  difficult,  can  a  tree  be 
constructed  which  simultaneously  op¬ 
timizes  several  measures? 

(ii)  How  difficult  is  the  optimization  task 
for  each  criterion? 

(iii)  If  constructing  an  optimal  tree  is  too 
time  consuming,  are  there  fast  heuris¬ 
tic  methods  that  build  acceptable  sub- 
optimal  trees?  If  so,  how  good  are 
those  methods? 

In  this  section,  each  question  is  answered 
in  turn  and  a  survey  of  optimization  meth¬ 
ods  provided. 

3.1  Questions  of  Compatibility 

In  order  to  answer  the  first  question,  we 
examine  the  relationships  between  mea¬ 
sures.  We  shall  say  that  two  optimization 
criteria  are  compatible  if,  for  every  function 
in  a  given  family,  at  least  one  tree  can  be 
constructed  that  satisfies  both  criteria. 
Moret  [MoRE81b]  has  shown  that,  even  if 
we  restrict  our  attention  to  the  family  of 
Boolean  functions  with  uniform  costs  and 
probabilities  (the  case  that  is  least  condu¬ 
cive  to  incompatibilities),  all  criteria  are 
pairwise  incompatible,  with  two  exceptions: 
(1)  the  minimum  height  is  a  special  case  of 
the  minimum  reverse  profile,  and  (2)  the 
exact  relationship  of  the  number  of  diagram 
nodes  with  the  number  of  tree  nodes  is  as 
yet  unknown.  (However,  it  is  easily  shown 
that  the  minimization  of  one  does  not  nec¬ 
essarily  result  in  the  minimiration  of  the 
other.)  In  the  more  general  case  of  discrete 
functions  with  nonuniform  costs  and  prob¬ 
abilities,  all  criteria  are  pairwise  incompat¬ 
ible  [MORESlb]. 

Thus  it  appears  that  the  six  measures 
defined  on  decision  trees  and  diagrams  are 
essentially  independent,  so  that  we  are  in¬ 
deed  faced  with  a  problem  of  choice.  In 
order  to  gather  more  information  about  the 
possible  choices,  we  now  address  the  second 
question. 

3.2  Questions  of  Complexity 

The  problem  of  constructing  optimal  deci¬ 
sion  trees  and  diagrams  has  been  addressed 
by  many  researchers  using  branch-and- 
bound  techniques  [ReiL66,  ReiL67, 
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BftEi75b]  and  dynamic  programming 
[GARE72a,  Misr72,  Meis73,  Baye73, 
Schu76,  PayH77,  Mart78].  In  the  follow¬ 
ing,  we  review  those  and  more  specialized 
techniques.  Each  method  is  first  introduced 
and  its  salient  characteristics  mentioned;  a 
more  detailed  explanation  follows,  which 
the  less  mathematically  inclined  reader 
may  wish  to  skip. 

3.2. 1  The  Dynamic  Programming  Method 

Dynamic  programming  is  of  particular  in¬ 
terest  as  all  of  our  tree  measures  obey  the 
“principle  of  optimality,”  that  is,  they  are 
such  that  an  optimal  solution  can  be  built 
from  optimal  subsolutions.  This  is  the  case 
because  the  building  of  a  decision  subtree 
for  each  restriction  is  a  separate  problem 
that  can  be  optimally  solved  independently 
of  the  others.  This  also  tells  us  that  our 
sixth  measure,  the  diagram  cost,  does  not 
obey  the  principle  of  optimality,  since  sub¬ 
diagrams  often  overlap;  the  resulting  inter¬ 
action  destroys  the  independence  of  the 
subproblems. 

The  general  algorithm  builds  the  optimal 
tree  from  the  leaves  up  by  identifying  suc¬ 
cessively  larger  optimal  subtrees  (one  for 
each  combination  of  tested  and  untested 
variables).  This  approach  is  embodied  in  an 
algorithm  designed  to  convert  limited-entry 
decision  tables  to  decision  trees  with  mini¬ 
mal  expected  testing  cost  [Baye73].  This 
solution  was  independently  discovered  by 
Schu76,  who  generalized  it  to  extended- 
entry  decision  tables;  a  refined  version,  us¬ 
ing  some  game  tree  heuristics,  was  recently 
published  [Mart78].  A  closely  related  pro¬ 
cedure  was  developed  for  use  in  pattern 
recognition  to  minimize  the  worst  case  or 
the  expected  testing  cost  of  binary  decision 
trees  [Meis73,  PayH77].  The  earliest  ver¬ 
sion  of  the  algorithm  appears  to  be  due  to 
Gare70  (see  also  Gare72s,  Misr72)  in  the 
context  of  binary  identification  problems. 

For  a  function  of  n  k-ery  variables,  the 
algorithm  requires  a  number  of  operations 
proportional  to  n  •  k  •  (k  +  l)"-1.  Since  a 
complete  specification  of  the  function  re¬ 
quires  S  —  kn  items  of  information,  the 
algorithm  takes  0(Slo,k,*'MI  •  log  S)  time  for 
completely  specified  k-exy  functions  and  is 
thus  fairly  efficient.  For  partial  functions. 
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in  particular  identification  problems,  how¬ 
ever,-  the  input  size  may  be  much  smaller; 
a  binary  identification  problem  with  m  tests 
and  n  objects  requires  the  specification  of 
m  items  each  of  size  n  (the  answer  to  each 
test  for  each  object)  for  an  input  size  of  S 
■  m-B.  In  this  case,  the  algorithm  may 
require  time  exponential  in  the  input  size, 
thereby  being  very  inefficient.  In  fact,  bi¬ 
nary  identification  problems  appear  to  be 
intrinsically  “hard,”  that  is,  there  is  consid¬ 
erable  evidence  that  no  algorithm  can  be 
developed  for  them  that  would  not  require 
exponential  time.1 

We  now  examine  in  more  detail  how  the 
dynamic  programming  method  is  applied 
to  the  optimization  of  decision  trees  and 
provide  an  example;  the  less  mathemati¬ 
cally  inclined  reader  may  wish  to  skip  to 
the  beginning  of  the  next  section. 

If  variable  x„  with  testing  cost  f,  and 
storage  cost  s„  is  tested  at  the  root  of  a 
decision  tree  for  the  function  f(x i, . . . ,  x»), 
then  the  optimal  values  for  the  first  three 
measures  of  Definition  2  are 

Attn  if)  “  t  +  IIlax{/imin(/|nI-o)i 

h  min (/|  x,— m(-~l)  } , 

in,— 1 

Eminif)  ■  t  +  j  P(  Xi  =7)  •  Emm {f  \ 

7-0 

<*raui(  f)  *  *i  +  j  *  min(/|x,— /),  (4) 

7-0 

where  p(z,~f)  denotes  the  probability  that 
Xi  takes  on  the  value  j.  Similarly,  the  leaf 
profile  of  this  tree  is  obtained  by  summing, 
component  by  component,  the  leaf  profiles 
of  its  subtrees,  then  introducing  an  addi¬ 
tional  first  component,  set  to  0.  In  a  decision 
diagram,  however,  the  subdiagrams  repre¬ 
senting  the  various  restrictions  usually 
overlap,  so  that  the  storage  cost  of  a  func¬ 
tion  is  not  directly  related  to  the  storage 
costs  of  its  restrictions.  This  shows  that  five 
of  our  six  measures  indeed  obey  the  prin¬ 
ciple  of  optimality. 

The  algorithm  will  generate  all  possible 
restrictions  of  the  function.  For  a  function 

1  Technically,  they  are  NP-hard  (Gase79)  problems, 
as  proved  in  Hyaf76  and  Love79;  for  a  detailed  dis¬ 
cussion  of  the  exact  complexity,  the  reader  is  referred 
to  MORESlb. 


of  A-ary  variables,  each  variable  can  be  in 
any  of  (k  +  1)  conditions  (k  values  and  the 
untested  state)  so  that  there  are  (k  +  1)" 
distinct  combinations;  generating  them 
from  the  bottom  km  leaves  (all  variables 
tested)  to  the  unique  top  node  (all  varia¬ 
bles  untested)  requires  a  number  of  steps 
equal  to 

£  (n-i)  -  Q  -  -  n  -  k  -  (k  +  1)— *, 

(5) 

since  a  node  with  <  untested  variables  has 
n  —  i  possible  parents  (each  with  one 
more  untested  variable)  and  can  be  chosen 
in  (?)  •  A"-1  ways.  Therefore,  using  the 
"big  Oh”  notation  of  algorithm  anal¬ 
ysis  [Weid77],  the  algorithm  takes  0(n  •  k  • 
( k  +  l)"-1)  time. 

A  restriction  with  n  —  i  untested  vari¬ 
ables  determines  a  subspace  of  ft'  points, 
which  we  call  an  i-subcube.  The  algorithm 
starts  by  considering  all  O-subcubes  (that 
is,  all  points  in  the  variables’  space),  then 
forms  all  possible  1-subcubes  by  merging  k 
O-subcubes,  in  effect  letting  one  variable  be 
undetermined  (so  that  a  unique  variable  is 
associated  with  each  merging).  This  process 
continues,  forming  all  i-subcubes  by  merg¬ 
ing  (i  -  l)-subcubes,  until  the  final  n-sub- 
cube  (the  complete  space)  is  formed.  Each 
subcube  is  identified  by  an  n-tuple  of  val¬ 
ues,  (it, ...,{.),  where  i/  is  X  if  the  Jth 
variable  is  untested  at  that  node,  and  is  the 
variable’s  value  otherwise.  For  instance,  the 

O-subcubes  identified  by  (0,1 . 1), 

(1,1 . 1) . (*  -  1,1 . 1)  can  be 

merged  into  the  1-subcube  given  by  (X, 
1, . . . ,  1)  by  letting  variable  xt  be  untested. 
Thus  the  process  builds  a  lattice  of  (k  +  1)* 
nodes  with  n  -  k  •  (k  +  l)"-1  edges.  It  is 
noted  that  the  same  i-subcube  can  be 
formed  by  merging  one  of  i  distinct  A-tuples 
of  (i  -  l)-subcubes. 

As  the  lattice  is  built,  each  node  (i.e., 
each  subcube)  is  assigned  a  cost  and  a 
value;  if  we  are  interested  in  the  expected 
testing  cost,  the  probability  of  each  node  is 
also  computed.  The  probability  of  t  -  i-sub¬ 
cube  is  just  the  sum  of  the  probaouiues  of 
the  k  merged  (i  -  l)-subcubes;  the  value  of 
the  function  for  an  i-subcube  is  that  of  the 
k  merged  (i  -  l)-subcubes,  if  their  function 
values  were  identical,  and  is  “?”  (a  special 
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symbol  indicating  that  the  value  is  a  non* 
constant  function)  otherwise.  One  easily 
verifies  that  these  quantities  are  well  de¬ 
fined,  that  is,  independent  of  the  choice  of 
the  merged  subcubes.  Finally,  the  cost  as¬ 
signed  to  an  i-subcube  is  the  minimum  cost 
of  merging,  where  the  merging  of  k  (i  -  I)- 
subcubes  has  cost  0  if  all  k  subcubes  have 
identical  known  values,  and  is  equal  to  one 
of  the  equations  (4)  otherwise.  Initial  con¬ 
ditions  are  given  by  0-subcubes  with  values 
and  probabilities  given  by  the  problem.  The 
cost  of  the  top  node  (the  n-subcube)  is  just 
the  cost  of  the  optimal  tree  for  the  function; 
the  tree  itself  can  be  recovered  by  walking 
down  the  lattice,  choosing  at  each  step  to 
test  the  variable  that  gave  rise  to  the  least 
costly  merging,  until  nodes  with  a  known 
value  (leaves)  are  reached. 

Example  9 

We  shall  make  use  of  the  function  of  Ex¬ 
ample  7  to  show  how  the  dynamic  program¬ 
ming  algorithm  produces  a  decision  tree 
with  minimal  expected  testing  cost.  With  3 
variables,  the  variables’  space  has  2s  »  8 
points,  each  identified  by  a  unique  combi¬ 
nation  of  the  3  variables.  The  algorithm  will 


build  a  lattice  of  3"  nodes  with  n  •  3”  pairs 
of  edges,  as  shown  in  Figure  7.  Since  we  are 
interested  in  the  expected  testing  cost,  we 
keep  track  at  each  node  of  the  function’s 
value,  the  node’s  probability,  and  the  merg¬ 
ing  costs,  where  the  cost  of  merging  two 
(j  —  l)-8ubcubes  is  0  if  both  subcubes  have 
identical  known  values,  and  is  equal  to 
c  •  (p\  +  pi)  +  Cj  +  c* 

otherwise,  where  c\,pi  to, pt)  are  the  cost 
and  probability  of  the  first  (second)  (i  - 1)- 
subcube,  respectively,  and  c  is  the  testing 
cost  of  the  variable  used  in  merging.  Since 
the  cost  of  the  least  expensive  merging 
which  produces  the  top  node  is  5.05,  that  is 
the  cost  of  the  optimal  tree  for  our  problem. 
The  large  arrows  in  Figure  7  show  which 
merging  was  least  expensive  at  each  step 
and  allow  the  recovery  of  the  optimal  de¬ 
cision  tree,  in  this  case,  the  linear  testing 
sequence  x% (the  fifth  tree  in  Figure 
4).  □ 

3.2.2  The  Branch-and-Bound  Method 

Our  sixth  measure,  the  diagram  storage 
cost,  is  not  amenable  to  a  solution  by  dy¬ 
namic  programming,  since  it  supposes  iden- 
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tificatdon  of  common  subtrees  and  thus  a 
global  (as  opposed  to  dynamic  program¬ 
ming’s  local)  view  of  the  subproblems.  Thus 
optimization  of  diagram  storage  cost  is  done 
by  search  techniques,  principally  branch- 
and-bound  [ReiL67],  a  method  which  has 
also  been  applied  to  the  optimization  of  the 
expected  testing  cost  [ReiL66,  BnEl75b]. 

Recall  that  a  branch-and-bound  algo* 
rithm  proceeds  by  always  developing  that 
partial  solution  which  is  potentially  less 
expensive  than  any  other  (as  determined 
by  a  lower  bound  function),  often  switching 
from  one  partial  solution  to  another  when 
lower  bounds  change,  until  one  solution  has 
been  completely  developed  [Lawl66],  The 
lower  bound  function  used  for  the  diagram 
and  tree  storage  costs  [ReiL67]  is  based 
upon  this  simple  fact:  a  nonredundant  vari¬ 
able  must  appear  at  least  once  in  any  dia¬ 
gram.  Thus  a  rough  lower  bound  can  be 
derived  by  simply  summing  the  storage 
costs  of  all  the  nonredundant  variables.  The 
lower  bound  used  for  the  expected  testing 
cost  can  be  derived  in  an  analogous  fashion 
by  considering  the  a  priori  probability  that 
each  variable  will  be  tested  in  any  decision 
tree  representation  and  modifying  it  to  re¬ 
flect  the  influence  of  the  choice  of  a  root 
[ReiL66,  BREi75b,  MoRE80b],  or  it  can  be 
derived  from  first  principles  as  a  complexity 
measure  on  decision  trees  [MoRE80a]. 

The  principal  disadvantage  of  the 
branch-and-bound  method  is  that  it  may 
result  in  a  near-exhaustive  search  of  the 
possible  trees  and  diagrams,  a  process  that, 
in  view  of  the  dimension  of  the  search  space 
(as  previously  discussed),  leads  to  intolera¬ 
bly  long  computations.  In  terms  of  algo¬ 
rithm  analysis,  the  branch-and-bound  tech¬ 
nique  is  an  exponential-time  algorithm,  re¬ 
gardless  of  the  input  size. 

We  now  examine  in  more  detail  the 
bounding  functions  mentioned  above  and 
illustrate  the  use  of  branch-and-bound 
methods  by  a  simple  example.  Again,  the 
less  mathematically  inclined  reader  may 
wish  to  skip  to  the  next  section. 

A  lower  bound  on  the  storage  cost  of  a 
decision  tree  must  incorporate  the  influence 
of  the  choice  of  a  given  variable  to  be  of  use 
in  the  branch-and-bound  process.  To 
achieve  this  end,  the  lower  bound  is  modi¬ 


fied  by  using  a  one-level  “look-ahead”;  if 
variable  x,  is  chosen  for  the  root  of  the  tree 
representing  function  f,  then  the  lower 
bound  is  the  sum  of 

(i)  the  storage  cost  of  the  chosen  variable, 

x.; 

(ii)  the  storage  cost  of  each  x„  j  **  i,  times 
the  number  of  restrictions,  /!*-.*,  for 
which  Xj  is  nonredundant. 

For  diagrams,  the  multiplicative  factor  in 
(ii)  is  modified  to  take  into  account  the  fact 
that  xj  may  play  exactly  the  same  role  for 
some  restrictions,  that  is,  that  for  every 
combination  of  values  of  the  remaining 
variables,  either  all  the  restrictions  are 
equal  or  at  most  one  is  not  constant. 

The  lower  bound  on  the  expected  testing 
cost  of  a  decision  tree  can  be  derived  from 
first  principles  by  considering  the  develop¬ 
ment  of  a  measure  of  the  influence  of  a 
variable  on  the  expected  testing  cost  of 
decision  tree  representations.  Any  such 
measure  should  possess  the  following  two 
properties: 

(i)  the  measure  is  minimal  (equal  to  zero) 
when  the  variable  is  redundant  and 
maximal  (equal  to  the  variable’s  testing 
cost)  when  the  variable  is  indispens¬ 
able; 

(ii)  the  measure  is  compatible  with  the  tree 
structure,  that  is,  if  we  denote  such  a 
measure  for  the  variable  x<  by  a^(x,),  it 
must  be  the  case  that,  for  each  j  i, 

«y-l 

af(x>)  -  £  p(xj  -  k)  ■  af L.*(x,). 

*-o 

Moret  [MoRE80a]  has  shown  that  only  one 
measure  satisfies  those  two  conditions:  the 
activity  of  a  variable,  which  is  equal  to  the 
testing  cost  of  the  variable  times  the  a  priori 
probability  that  it  will  be  tested  (a  concept 
related  to  the  Boolean  difference  used  in 
Boolean  algebra  [THAY81b]).  The  a  priori 
probability  that  variable  x,  will  be  tested  is 
just  the  probability  that,  with  ail  its  other 
variables  evaluated,  the  function  still  de¬ 
pends  on  x, ;  this  is  easily  computed  in  linear 
time.  The  lower  bound  used  in  ReiL66  and 
BREi75b  can  then  be  defined  as  the  sum  of 
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(a) 


Figure  8.  The  partial  trees  as  developed  by  the  branch-and-bound  method  for  the  function  of  Example  Ilk 
(a)  after  the  first  stage;  (b)  after  the  second  stage. 


the  testing  cost  of  the  root  variable  and  the 
activities  of  the  remaining  variables 
[MoRE80b]. 

Example  10 

Consider  again  the  Boolean  function  of  our 
previous  examples.  The  activities  of  the 
three  variables  are  found  to  be 

ttyfxi)  -  l-prob(x2  -  1  and  x3  -  0) 
-0.3, 

Of(xt)  -  2*prob(;t3  -  1  or  Xi  -  0) 
-1.4, 

af(x »)  -  6-prob(xi  -  1  and  x2  -  1) 
-2.4. 

Now  the  lower  bound  on  the  expected  test¬ 
ing  cost  of  a  decision  tree  for  f  with  root  x,, 
lbf(x, )  can  be  computed  for  each  variable 

Ibf(xi)  -  t(x i)  +  ctf(xi)  +  Of(x3)  -  4.8, 

Ibfixt)  -  t(Xi)  +  af(x i)  +  af(x3)  -  4.7, 

lbf(x j)  -  t(x3)  +  a/(xi)  +  a,(xt)  -  7.7. 

Thus  the  branch-and-bound  algorithm 
chooses  to  develop  the  tree  rooted  in  x2. 


The  left  subtree  is  then  a  leaf  labeled  0  so 
that  only  two  possible  partial  trees  arise, 
depending  on  the  choice  of  the  variable 
tested  at  the  root  of  the  right  subtree. 
Lower  bounds  are  computed  in  turn  for 
these.  We  now  have  four  partial  subtrees, 
pictured  with  their  lower  bounds  in  Figure 
8a.  The  algorithm  will  choose  to  develop 
the  first  partial  tree  (rooted  in  xi),  since  it 
is  now  the  least  expensive;  this  yields  four 
partial  trees  for  a  total  of  seven  partial 
trees,  pictured  with  their  lower  bounds  in 
Figure  8b.  Now  the  algorithm  will  return  to 
the  fifth  tree  (rooted  in  x2)  and  complete  it; 
since  itB  final  coat,  5.05,  is  lower  than  the 
bound  on  any  partial  tree,  the  completed 
tree  is  optimal.  □ 

3.2.3  Other  Methods 

In  the  context  of  logic— particularly  Boo¬ 
lean — functions,  specialized  methods  have 
been  devised  which  attempt  to  use  some  of 
the  standard  tools  of  logic  (such  as  reduc¬ 
tion  to  canonical  or  minimal  formulas  and 
decompositions)  in  order  to  construct  de¬ 
cision  trees  and  diagrams  with  minimal 
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storage  or  expected  testing  cost  [Mich78, 
Thay78,  Cern79s,  THAY81a]. 

The  minimization  of  storage  or  expected 
testing  cost  for  decision  trees  has  been  ap¬ 
proached  for  Boolean  [CERN79a]  and  mul¬ 
tivalued  [Thay78]  logic  functions  by  con¬ 
sidering  a  special  class  of  subfunctions, 
which  the  authors  call  T-terms.  Starting 
with  the  terms  of  order  0,  which  are  just 
logic  cubes  (in  the  sense  of  switching  theory 
[Harr65]),  terms  of  successively  higher  or¬ 
der  are  constructed  by  consensus  opera¬ 
tions.  (That  is,  for  each  variable,  x>,  one 
takes  the  consensus  of  a  T-term  of  order  n 
and  one  of  order  k  <  n  with  respect  to  x,\ 
the  result  is  a  T-term  of  order  n  +  1  if  it  is 
not  already  a  T-term  of  lower  order  and  if 
it  is  independent  of  xj.)  Only  the  prime 
terms  are  kept  in  the  construction  process 
(where  a  prime  term  is  a  term  not  contained 
in  any  other  term  of  the  same  order).  A 
simple  procedure  is  then  used  to  derive  a 
.  tree  optimal  with  respect  to  worst-case  test¬ 
ing  cost,  expected  testing  cost,  or  storage 
cost.  Although  the  algorithm  sheds  light  on 
the  relationship  between  Boolean  formulas 
and  optimal  decision  trees,  it  is  not  of  prac¬ 
tical  interest  (expect  for  minimizing  dia¬ 
gram  storage  cost)  since  it  contains  a  hard 
problem;  the  prime  terms  of  order  0  are  just 
the  prime  implicants  of  the  function 
[Harr65]  and  obtaining  them,  even  from 
complete  function  descriptions,  is  known  to 
be  NP-hard  [Mase82].  As  a  result,  the  pro¬ 
cedures  proposed  may  require  exponential 
time  under  any  input  size.  A  similar  draw¬ 
back  is  present  in  the  algorithm  published 
by  Mich78,  which  converts  extended-entry 
decision  tables  to  decision  trees  with  mini¬ 
mal  storage  or  expected  testing  cost,  since 
the  algorithm  starts  by  establishing  a  min¬ 
imal  disjoint  set  cover  for  the  function,  a 
process  known  to  be  NP-hard  [Gake79]. 

An  extension  of  T-terms  was  recently 
proposed  by  Tha  81a  for  the  design  of 
decision  trees  and  diagrams  with  minimal 
storage  cost.  This  formulation  is  based  on 
a  class  of  functions  called  P-functions, 
where  a  P-function  for  the  Boolean  function 
f  is  a  pair,  (g,  h),  of  functions  such  that  f 
and  h  are  equal  when  restricted  to  the 
points  where  g  evaluates  to  1.  A  composi¬ 
tion  operation  is  defined  that  allows  the 
building  of  a  lattice  of  P-functions,  from  the 


lowest  order  (with  h  «  1  or  h  —  0)  to  the 
highest  order  (with  g  “  1).  Again,  only 
prime  P-functions  are  retained  in  building 
the  lattice.  A  search  procedure  generates 
optimal  decision  diagrams  from  the  lattice 
of  prime  P-functions.  The  generation  of 
optimal  decision  trees,  however,  requires 
that  prime  P-functions  be  replaced  by 
prime  P-cubes,  that  is,  prime  P-functions 
restricted  so  that  they  are  logic  cubes.  Since 
the  lowest  order  P-cubes  comprise  the 
prime  implicants  of  the  function  and  its 
complement,  we  find  anew  the  NP-hard 
subproblem,  so  that  the  synthesis  of  opti¬ 
mal  decision  trees  by  P-functions  requires 
exponential  time  and  is  therefore  of  little 
practical  interest.  (It  must  be  noted  that  we 
do  not  imply  that  finding  prime  implicants 
is  an  unsurmoun table  task;  in  fact,  several 
algorithms  for  that  purpose  have  been  stud¬ 
ied  and  shown  to  do  well  in  practice 
[Slag70,  Hulm75].  Our  point  is  that  the 
methods  described  above,  which  incorpo¬ 
rate  this  NP-hard  problem  as  only  a  small 
part  of  the  complete  work,  cannot  compare 
with  the  dynamic  programming  method.) 
On  the  other  hand,  the  construction  of  op¬ 
timal  diagrams,  while  exponential  (no  anal¬ 
ysis  is  provided  with  the  algorithm,  but  the 
generation  of  all  prime  P-functions  may 
clearly  require  that  much  time),  remains  of 
interest  since  no  substantially  better  solu¬ 
tion  is  known. 

We  now  proceed  to  a  closer  examination 
of  the  composition  of  P-functions,  followed 
by  an  example.  Once  again,  the  less  math¬ 
ematically  inclined  reader  may  wish  to  skip 
to  the  next  section. 

If  /"has  n  variables,  it  has  up  to  2**  -  1  P- 
functions  on  which  a  lattice  structure  can 
be  established,  from  the  lowest  to  the  high¬ 
est  order  by  means  of  the  following  non- 
commutative  composition  law  (designed  to 
be  compatible  with  Shannon’s  decomposi¬ 
tion).  A  P-function  of  order  k  +  1,  (g,  h), 
is  obtained  from  two  P-functions  of  lower 
order  (one  of  order  k  and  the  other  of  order 
no  larger  than  k),  ( go,  ho)  and  (gu  hi),  by 
using,  for  each  x,,  the  formula 

(g,h)  »  <£oU-o-£iU,-i,  x,-ho  +  x,-hi),  (6) 

which  we  denote  (go,  ho)  ©,  (g\,  hi ).  In  a 
sense,  the  resulting  P-function  of  order 
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k  +  1  corresponds  to  our  state  of  knowledge 
prior  to  testing  variable  x,.  Indeed,  substi¬ 
tuting  x,  —  0  into  (g,  h)  yields 

Ao>, 

and  substituting  x,  —  1  yields 

hi ), 

which  shows  how  the  second  function  in 
the  pair  reflects  our  increased  knowledge 
about  the  function  f  (until  that  second  func¬ 
tion  is  a  constant,  meaning  that  the  evalu¬ 
ation  of  /is  complete),  while  the  first  func¬ 
tion  provides  us  with  information  about  the 
path  of  evaluation  followed  so  far. 

Example  1 1 

Consider  the  Boolean  function  of  our  pre¬ 
vious  examples.  The  two  prime  P-functions 
of  order  0  are 

Ao-</,l>  and  Ai  —  </,  0). 

From  those  two  functions,  we  can  form  six 
P-functions  of  order  1,  three  of  which  are 
prime: 

So  -  Ao  ©i  Ai  »  (xaxi  xi), 

Bi  -AiOtAo-  (xi  +Xs,X2), 

Bt  -  Ai  ©a  Ao  «  < x1xJ,  x3 ). 

Using  now  the  five  prime  P-functions  of 
order  0  and  1,  we  can  form  54  P-functions 
of  order  2  (not  all  distinct),  four  of  which 
are  prime: 

Co  **  Ao  ©i  B2  =  Bo  ©a  Ao 
“  <X2,  xi  +  x3), 

Ci  *  Ai  ©2  B i  *  Bi  ©2  Ao 
«(xi  +  x3,x2),  __ 

Ct  -  Ai  ©3  Bi  -  (xi  +  X2,  xjx3>, 

C»  -  B,  ©i  Ax  -  (x2  +  x»,  xiXz ), 

Finally,  we  can  use  our  nine  prime  P-func¬ 
tions  of  orders  0, 1,  and  2  in  order  to  obtain 
the  single  prime  P-function  of  order  3  at 
the  top  of  the  lattice,  Do  ■  ( 1,  /): 

Do  ■*  Ai  ©2  Co  ■  Ci  ©2  Co  ■  Ci  ©2  Co 

■  Ci  ©i  Ct  ■“  Bi  ©i  Ci  ■  Ci  ©3  Ci 

■  Ci  ©3  B i- 

The  corresponding  lattice  is  shown  in  Fig¬ 
ure  9,  where  a  number,  i,  in  a  circle  has 
been  used  to  denote  a  composition  with 
respect  to  the  ith  variable.  A  search 


Decision  Trees  and,  Diagrams  *  609 


Figure  0.  The  lattice  of  prime  P-functions  of  Exam¬ 
ple  11. 


through  the  lattice  shows  that  the  optimal 
diagrams  have  three  nodes  for  a  storage 
cost  of  6;  the  diagram  of  Figure  2  was  one 
of  those.  □ 

3.3  Questions  of  Optimality 

Since  decision  tree  optimization  is  an  NP- 
hard  problem  under  polynomial-size  inputs 
[MoRE80b],  it  is  necessary  to  develop  some 
heuristics  that  will  allow  the  fast  construc¬ 
tion  of  good,  albeit  not  optimal,  solutions. 
Indeed,  some  such  heuristics  have  been 
proposed  even  before  an  optimal  algorithm 
was  developed;  it  appears  that  the  so-called 
“splitting”  heuristic,  discussed  below,  was 
known  to  Aris  to  teles  and  Theophrastus 
[Voss52],  and  many  heuristics  were  pro¬ 
posed  for  the  conversion  of  decision  tables 
[Mont62,  Egle63,  Poll65]  before  the  pub¬ 
lication  of  the  branch-and-bound  solution 
of  ReiL66. 

All  published  heuristics  are  of  the  so- 
called  greedy  type,  that  is,  they  perform  a 
local,  step-by-step  optimization.  Three 
main  types  of  criteria  are  used;  the  appar¬ 
ently  large  variety  results  from  attempts 
to  accommodate  tests  with  variable  out- 
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comes  (as  in  most  biological  applications 
[Brow77,  Gowe75,  PayR81]),  or  from  mi¬ 
nor  differences  in  preprocessing  (such  as 
attempts  to  put  decision  tables  in  a  canon¬ 
ical  form  [Shwa75]),  or  in  the  extent  of 
look-ahead  used  (while  most  strategies  use 
no  look-ahead  at  all,  Seth80  has  suggested 
a  one-step  look-ahead  and  Mich78  pro¬ 
posed  a  range  of  look-aheads,  from  0  to  k 
step).  These  three  criteria  are  discussed  in 
some  detail  below.  (Since  the  discussion  of 
the  first — and  most  important — of  these 
criteria  involves  some  mathematical  ma¬ 
nipulations,  we  once  again  have  organized 
its  discussion  in  two  parts,  grouping  all 
mathematical  concepts  in  the  second.) 

3.3. 1  The  Information-Theoretic  Criterion 

In  this  strategy,  commonly  used  for  the 
(near)  minimisation  of  the  expected  testing 
cost,  the  problem  is  viewed  as  one  of  refin¬ 
ing  an  initial  uncertainty  about  the  func¬ 
tion’s  value  into  a  certitude.  At  each  step, 
the  test  of  a  variable  diminishes  the  uni¬ 
verse  of  possibilities,  thereby  removing  a 
certain  amount  of  ambiguity.  In  informa¬ 
tion-theoretic  terms,  the  initial  ambiguity 
of  a  (partial)  function  is  expressed  by  the 
entropy  of  the  function  (for  a  lucid  exposi¬ 
tion  on  the  topic,  see  the  original  paper  of 
Shan48).  The  ambiguity  remaining  after 
testing  a  variable  can  be  computed  as  the 
average  ambiguity  among  the  restrictions. 
This  allows  the  computation  of  the  ambi¬ 
guity  removed  (or,  equivalently,  the  infor¬ 
mation  gained)  by  testing  that  variable. 
The  information  heuristic  then  chooses  at 
each  step  that  variable  which  removes  the 
most  ambiguity  per  unit  testing  cost. 

The  previously  mentioned  splitting  heu¬ 
ristic  is  a  special  case  of  the  information 
heuristic  for  identification  problems.  It  is 
well  known  [Shan48]  that  the  removed 
ambiguity  is  maximized  by  letting  all  re¬ 
strictions  have  equal  probability;  thus, 
when  all  variable  costs  are  unity,  the  infor¬ 
mation  heuristic  chooses  that  variable 
which  “splits”  the  set  of  objects  into  subsets 
with  most  nearly  equal  probabilities.  In  the 
case  of  binary  identification  problems  with 
equally  likely  objects  and  unity  costs,  this 
means  selecting  that  test  which  splits  the 
objects  into  subsets  of  most  nearly  equal 


size.  Yet  another  aspect  of  the  splitting 
heuristic  is  a  criterion  used  for  binary  iden¬ 
tification  problems  [Gyll63,  Chan65], 
which  selects  that  variable  which  separates 
the  largest  number  of  pairs  of  values:  if  n  is 
the  total  number  of  values  and  k  the  num¬ 
ber  of  values  put  into  one  subset,  then 
k-(n  —  k)  pairs  are  split.  This  criterion  is 
optimized  for  k  —  n/2.  (Choosing  that  vari¬ 
able  which  separates  the  largest  number  of 
pairs  could  be  described  as  the  separation 
heuristic;  this  criterion  also  appears  in  a 
variety  of  forms,  some  of  which  attempt  to 
include  tests  with  variable  outcomes 
[Brow77,  PayR80].) 

Descriptions  of  the  heuristic  and  its  var¬ 
iants  abound  [Klet60,  Resc61,  Osbo63, 
Mand64,  Gowe71,  G ana 73,  Shwa74],  but 
it  was  not  until  later  that  the  performance 
of  the  information  heuristic  was  analyzed. 
Garey  [Gare74]  studied  its  application  to 
identification  problems  and  showed  that, 
although  it  is  quasi-optimal  when  all  pos¬ 
sible  tests  are  available  (a  result  dating 
from  Zimm59),  there  are  problems  for  which 
it  can  construct  trees  with  an  expected  test¬ 
ing  cost  arbitrarily  larger  than  the  optimal. 
This  result  disproved  a  long-standing  con¬ 
jecture  that  the  splitting  algorithm  was  op¬ 
timal  for  identification  problems  with 
equally  likely  objects  [Klet60,  Osbo63]; 
such  a  result  is,  of  course,  predictable  now 
in  view  of  the  NP-hardness  of  the  problem 
since  an  optimal  polynomial  algorithm 
would  disprove  the  widely  held  opinion  that 
NP-hard  problems  actually  require  expo¬ 
nential  solutions.  However,  Hung74  proved 
that  the  heuristic  is,  on  the  average,  asymp¬ 
totically  optimal;  that  is,  the  ratio  of  the 
average  cost  of  trees  built  with  the  infor¬ 
mation  heuristic  to  that  of  the  optimal  trees 
converges  to  1  as  the  problems  get  larger. 
This  result  must  be  qualified  by  the  obser¬ 
vation  that  most  functions  have  an  ex¬ 
pected  testing  cost  fairly  close  to  maximum, 
so  that  the  asymptotic  ratio  used  in 
Hung74  is  in  general  fairly  small  for  any 
heuristic.  Indeed,  MoRE81b  showed  that 
completely  specified  Boolean  functions  of 
n  variables  with  unity  costs  and  uniform 
probability  distribution  have  an  asymptotic 
expected  testing  cost  of  n  -  1,  so  that  in 
this  case  the  asymptotic  average  ratio  must 
be  1  for  any  heuristic.  Recently,  Hart82 
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presented  a  generalization  of  the  informa¬ 
tion  heuristic  which  allows  for  more  com¬ 
plex  objective  functions  (including  the  ac¬ 
quisition  cost)  and  offers  a  trade-off  be¬ 
tween  the  complexity  of  the  construction 
«m<  the  upper  bounds  that  can  be  placed 
on  the  size  of  the  resulting  solution. 

The  information  heuristic  is  efficient:  for 
an  input  size  of  O(S),  it  takes  time  0(S • 
log  S)  which,  while  barely  faster  than 
the  optimal  dynamic  programming  solution 
for  exponential  size  inputs,  compares  very 
favorably  indeed  with  the  exponential¬ 
time  algorithms  used  for  identification 
problems. 

We  now  examine  the  entropy  computa¬ 
tions  in  some  detail;  once  again — and  for 
the  last  time— the  less  mathematically  in¬ 
clined  reader  may  wish  to  skip  to  the  next 
section. 

The  expression  for  the  entropy  of  a  func¬ 
tion  /  is  [Shan48] 

Hf  -Y,p{f -  u)-log2p(f-  v ),  (7) 

V 

where  p(f  »  v)  is  the  probability  that  f 
takes  the  value  v  and  the  sum  is  taken  over 
all  the  values  v  in  the  range  of  f  (values  of 
p(f  «  v)  are  normalized  so  that  their  sum 
over  the  values  of  v  is  equal  to  1).  After 
testing  variable  x„  the  remaining  ambigu¬ 
ity,  Hf(Xj),  is  the  average  ambiguity  among 
the  restrictions: 

-  2  [p(xi-j)-Hf\xrJ. ]. 

*-o 

Hence  the  ambiguity  removed  by  testing  x, 
is  the  quantity 

//(*<)  ■  Hf-Hf(xi).  (8) 

This  quantity  is  computed  for  each  vari¬ 
able;  the  information  heuristic  then  chooses 
at  each  step  that  variable  which  has  the 
greatest  ratio,  If(x,)/t,. 

For  identification  problems,  the  expres¬ 
sion  for  the  removed  ambiguity  can  be  sim¬ 
plified  to 

l>(x,)--  2  p(Xi-j)’\o&p(Xi-j),  (9) 

i- ® 

which  is  seen  to  be  of  the  same  form  as  (7). 
Thus  maximizing  the  removed  ambiguity 
in  an  identification  problem  involves  find- 
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ing  that  variable  i,  such  that  the  probabil¬ 
ities  p(x,  — j )  are  most  nearly  equal  to  each 
other. 

Example  12 

Let  us  use  once  more  the  function  of  Ex¬ 
ample  7.  The  entropy  of  the  function  is 

H,--[p(f-0) -  log2p(/r-0) 

+  p{/-  1)  -  log2  p(/-  1)] 

-  -t0.6.1ogj0.6  +  0.4 -log* 0.4]  a  0.971. 

The  ambiguity  remaining  after  testing  x\  is 
H/(xi)  -p(xi  -  0)-H/|,,-o 

+  p(xj  ”  lJ'/f/jjr,-! 

a  0.5-1  4-  0.5-0.881  a  0.941. 

Thus  the  information  gain  of  x2  per  unit 
testing  cost  is 

fr(xi)  3  (0.971  -  0.941)  Q 
t(x0  1 

Similarly,  we  find 

J,(x2)  ^  0.971  -0.625  _ 
f(x2)  2 

^a0.971- 0.652  »  0.Q53, 

t(xs)  6 

so  that  the  heuristic  will  choose  to  test  x2 
first.  Since  the  restriction  for  x2  —  0  is 
constant,  only  the  restriction  for  x2  -  1 
remains.  The  information  gains  per  unit 
cost  of  xi  and  x2  for  that  restriction  are 

i>k-»(x.)  a  (0-961  -  0-382)  _  Q  57g 
t(xi)  1  ‘  ’ 

/,k.i(x3)  ^  (0.961  -  0.195)  ^  Q  12g 
t(xs)  6 

so  that  xi  is  tested  next  The  completed 
tree  is  in  this  case  the  optimal  tree  devel¬ 
oped  previously.  □ 

3.3.2  The  Activity  Criterion 
A  class  of  simple  heuristics  can  be  obtained 
for  any  problem  by  using  the  bounding 
functions  of  branch-and-bound  algorithms 
and  doing  local  optimization  on  their  basis, 
in  effect  using  branch-and-bound  without 
backtracking.  This  approach  is  of  no  partic¬ 
ular  interest  for  the  minimization  of  storage 
cost  (although  it  has  been  used  for  that 
purpose  [Rabi71,  Yasu71])  since  the  exist- 
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ing  bounds  are  too  loose;  it  is,  however, 
applicable  to  the  minimization  of  the  ex¬ 
pected  testing  cost,  since  the  lower  bound 
based  on  activity  is  in  general  tighter. 

The  activity  heuristic  has  the  same  com¬ 
putational  requirements  as  the  information 
heuristic.  Moret  [MoRE80b]  provided  an 
analysis  of  its  performance,  showing  that 
the  worst-case  ratio  for  completely  speci¬ 
fied  Boolean  functions  with  unity  costs  and 
uniform  probabilities  is  limited  to  2,  but 
can  be  arbitrarily  large  if  nonuniform  prob¬ 
abilities  are  allowed.  This  heuristic  appears 
to  be  of  less  interest  than  the  information 
heuristic,  since  it  performs  best  for  "dense” 
problems,  that  is,  those  in  which  the  func¬ 
tion  is  specified  on  most  of  its  domain, 
which  are  precisely  those  problems  that  can 
be  efficiently  solved  by  the  optimal  dy¬ 
namic  programming  algorithm. 

Finally,  a  similar  approach  has  been 
taken  by  some  authors  for  biological  iden¬ 
tification  problems,  using  rough  lower  and 
upper  bounds  on  the  number  of  tests 
needed  to  complete  an  identification  (see 
the  thorough  studies  of  Brow77  and 
PayR81).  In  one  such  approach,  it  is  pos¬ 
tulated  that  the  subtree  will  be  completed 
optimally  (i.e.,  following  Huffman's  proce¬ 
dure — even  though  only  a  small  proportion 
of  all  tests  is  available);  the  resulting  lower 
bound  is  used  for  deriving  a  selection  cri¬ 
terion  [Dall74,  Brow77],  Alternatively,  it 
is  assumed  that  the  tree  will  be  completed 
by  a  linear  sequence  of  simple  tests;  the 
resulting  estimate  is  an  upper  bound  under 
most  conditions  and  allows  the  derivation 
of  another  selection  criterion  [PayR81]. 
However,  those  criteria  are  based  on  rather 
simplistic  bounds  and  thus  susceptible  to 
large  errors;  despite  the  lack  of  either  the¬ 
oretical  or  practical  results  about  their  per¬ 
formance,  one  can  safely  predict  that  their 
average  performance  is  worse  than  that  of 
the  other  criteria  examined  so  far. 

3.3.3  M  Hoc  Criteria  for  Boolean  Functions 

Since  Boolean  functions  are  conveniently 
expressed  by  formulas,  special  heuristics 
can  be  developed  that  are  based  on  char¬ 
acteristics  of  the  formulas  such  as  number 
of  terms  or  literals. 

One  such  heuristic,  proposed  by  Halp74 
for  the  minimization  of  the  expected  testing 


cost,  necessitates  the  generation  of  all 
prime  implicants  for  the  function  and  its 
dual;  the  implicants  are  then  ranked  in 
terms  of  their  probability  to  cost2  ratio. 
Variables  which  appear  in  both  the  best 
implicant  for  the  function  and  that  for  its 
dual  are  then  selected  (at  least  one  such 
variable  must  exist).  Hal  pern  [Halp74] 
proved  that  this  strategy  is  optimal  for  sym¬ 
metric  functions  (those  that  remain  invar¬ 
iant  under  any  permutation  of  the  vari¬ 
ables),  but  offers  no  analysis  of  performance 
in  the  general  case. 

Breitbart  [Brei75s]  presented  a  similar 
heuristic  for  monotone  Boolean  functions 
with  unity  costs  and  uniform  probability 
distribution,  which  uses  the  minimal  dis¬ 
junctive  form  of  the  function  (this  form  is 
unique  for  monotone  functions  [Harr65]); 
in  a  later  analysis  [Brei78],  it  was  shown 
that  trees  constructed  by  this  rule  can 
have  an  expected  number  of  tests  at  least 
(n/Iog  n)  times  larger  than  the  optimal 
trees  for  functions  of  n  variables. 

Both  heuristics  apply  only  to  completely 
specified  Boolean  functions  and  require  ex¬ 
ponential  time  since  the  generation  of  all 
prime  implicants  and/or  the  minimization 
of  the  disjunctive  form  are  NP-hard  prob¬ 
lems  [Mase82],  Since  dynamic  program¬ 
ming  offers  an  0(S10,,3-log  S)  optimal  so¬ 
lution  to  the  same  problem,  these  heuristics 
are  of  interest  only  when  the  function  is 
already  specified  by  its  prime  implicants  or 
its  minimal  disjunctive  form. 

4.  APPLICATIONS 

In  this  section,  we  describe  the  main  fields 
of  application  and  review  related  results. 
We  distinguish  four  fields:  (1)  decision  table 
programming;  (2)  diagnosis,  identification, 
and  pattern  recognition;  (3)  logic  and  pro¬ 
gram  design;  and  (4)  analysis  of  algorithms. 
Of  these,  only  the  last  three  are  treated, 
since  decision  table  programming  is  chiefly 
concerned  with  the  construction  of  optimal 
decision  trees,  a  topic  with  which  we  dealt 
in  the  previous  section. 


*  The  cost  of  an  implicant  is  that  of  the  optimal  tree 
for  it;  that  tree  is  easily  constructed  [Ries63,  Slag64] 
since  tree  representations  of  conjunctions  of  variables 
are  just  linear  test  sequences. 
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4.1  Diagnosis,  Identification,  and  Pattam 
Recognition 

Garey  [Gare72s]  argues  that  most  identi¬ 
fication  problems  of  human  origin  include 
a  large  number  of  simple  tests  (of  the  type, 
“Is  the  unknown  object  of  type  i?”)  plus  a 
smaller  number  of  “well-splitting”  tests.  In¬ 
deed,  this  is  how  most  of  us  approach  the 
typical  identification  problem  of  “twenty 
questions,”  starting  with  general,  well-split¬ 
ting  questions  (e.g.,  "Is  it  mineral?")  and 
ending  the  game  with  simple  questions  (e.g., 
“Is  it  an  aardvark?”).  Several  large  identi¬ 
fication  problems  approximate  this  descrip¬ 
tion,  notably  in  botanical  and  biological 
classification  [Moll62,  Pank70,  Mors71, 
WillSO].  Garey  [GARE72a]  has  described  a 
dynamic  programming  algorithm  especially 
designed  for  this  type  of  problems  that  con¬ 
structs  identification  trees  with  minimum 
expected  testing  cost. 

In  most  identification  problems,  many 
more  tests  are  present  than  are  needed  for 
the  identification  of  all  objects.  In  conse¬ 
quence,  several  researchers  have  studied 
the  problem  of  obtaining  a  minimal  set  of 
tests;  we  recognize  in  this  problem  the  min¬ 
imization  of  the  total  acquisition  cost.  Such 
an  optimization  is  important  when  individ¬ 
ual  tests  are  time  consuming  and  prompt¬ 
ness  in  identification  essential  (as  in  medi¬ 
cal  diagnosis  [PayRSO,  Will80]),  so  that 
parallel,  rather  than  sequential,  testing  is 
used.  Unfortunately,  the  minimum  test  set 
problem,  as  it  is  known,  is  itself  NP-hard 
[Gare79],  As  a  result,  splitting  heuristics 
based  on  the  number  of  split  pairs  have 
been  proposed  for  the  construction  of  sub- 
optimal  solutions  [Gyll63,  Chan65];  how¬ 
ever,  no  performance  analysis  was  supplied. 
(The  general  analysis  of  the  splitting  algo¬ 
rithm  presented  in  the  previous  section 
does  not  apply  here  since  the  goals  are  quite 
distinct.)  Some  preliminary  analytical  re¬ 
sults  as  well  as  extensive  experimental  data 
can  be  found  in  More82.  This  problem  is 
closely  related  to  that  of  finding  prime  im- 
plicants  for  functions  of  Boolean  variables, 
since  each  minimal  set  of  prime  implicants 
determines  an  irredundant  set  of  tests  for 
the  problem.  Hence  methods  have  been 
proposed  that  first  find  all  prime  implicants 
and  then  attempt  to  find  a  set  of  prime 
implicants  that  minimizes  the  number  of 
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distinct  variables  used  (see  the  review  of 
PayRSO,  pp.  261-263). 

Since  decision  trees  model  multistage 
branching  decision  processes,  they  find  a 
particularly  important  application  in  se¬ 
quential,  or  more  precisely,  hierarchical, 
pattern  recognition  (see  Kan  a  79  for  some 
general  considerations  about  the  advan¬ 
tages  of  hierarchical  approaches).  In  the 
simplest  case,  a  pattern  recognition  prob¬ 
lem  is  deterministic  and  reduces  to  an  iden¬ 
tification  problem.  In  general,  however,  a 
type  (called  a  class)  of  objects  is  not  abso¬ 
lutely  characterized  by  selected  combina¬ 
tions  of  test  values  (called  features );  rather, 
the  problem  is  of  a  statistical  nature  such 
that  each  combination  of  features  is  distrib¬ 
uted  among  all  the  classes.  (This  model  of 
probabilistic  identification  also  corresponds 
to  the  fuzzy  decision  tables  discussed  in 
Kand80.)  If  the  set  of  features  has  suffi¬ 
cient  power  of  discrimination,  the  probabil¬ 
ity  distribution  of  each  combination  of  fea¬ 
tures  will  exhibit  one  strong  peak  for  some 
class,  so  that  an  object  possessing  this  com¬ 
bination  of  features  can  be  classified  in  that 
class  with  a  low  probability  of  error.  At 
times,  however,  it  may  be  advantageous  to 
trade  accuracy  for  speed  and  allow  an  ob¬ 
ject  to  be  classified  in  a  patently  wrong 
class  in  order  to  gain  on  response  time.  As 
a  consequence  of  this  additional  freedom, 
decision  trees  for  pattern  recognition  pur¬ 
poses  are  subject  to  yet  another  optimiza¬ 
tion  criterion;  minimum  overall  probability 
of  misclassifi  cation. 

Example  13 

Consider  the  following  simplified  pattern 
recognition  problem  with  three  classes,  Ci, 
Cz,  C3,  and  three  binary  features,  Xi,  x2,  x». 
The  testing  costs  of  the  three  variables 
(features)  are 

ft  xi  -»  1  x2  -» 2  xs  -+  3, 

and  the  probability  distribution  on  the  vari¬ 
ables'  space  is  given  by 

Pv;  (0, 0, 0)  -»  0.10  (1,  0, 0)  -*  0.20 

(0, 0, 1)  -♦  0.20  (1, 0, 1)  -»  0.10 

(0, 1,  0)  -»  0.10  (1,  1,  0)  -*  0.10 

(0, 1, 1)  -»  0.05  (1, 1, 1)  -*  0.15 

The  distribution  of  each  combination  of 
features  among  the  classes  is  given  below. 
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where  the  probability  that  a  given  combi¬ 
nation  corresponds  to  class  i  is  given  by  the 
rth  value  of  the  triple: 

Pc:  (0, 0, 0)  -»  (0.10, 0.85, 0.05) 

(0, 0, 1)  -»  (0.01,  0.98,  0.01) 

(0, 1, 0)  -*  (0.20, 0.10, 0.70) 

(0, 1, 1)  -+  (0.80, 0.10, 0.10) 

(1, 0, 0)  -»  (0.80, 0.10,  0.10) 

(1, 0, 1)  -»  (0.90, 0.05,  0.05) 

(1, 1, 0)  -»  (0.05, 0.05,  0.90) 

(1, 1, 1)  -»  (0.10, 0.00,  0.90) 

The  two  distributions  allow  us  to  compute 
the  a  priori  probability  of  each  class,  that 
is,  the  probablity  that  a  unknown  object 
belongs  to  that  class: 

p(Ci)  -  0.342 
p(C2)  -0.326 
p(C3)  -  0.332 

Since  each  combination  of  features  must  be 
classified  in  some  class,  the  strategy  that 
minimizes  the  probability  of  misclassifica- 
tion  is  obviously  to  classify  a  combination 
of  features  in  the  class  for  which  it  shows 
the  largest  probability;  thus  we  get  the  as¬ 
signment 

/:  (0, 0, 0)  -  C2  (1,0,0)  -  C, 

(0, 0, 1)  -*  C2  (1, 0, 1)  -»  C, 

(0, 1, 0)  -  C3  (1, 1, 0)  -  Cs 
(0, 1, 1)  -  C,  (1, 1, 1)  ->  Ca 

Hence  the  probability  of  misclassification 
of  an  object  with  the  combination  of  fea¬ 
tures  (0,  0,  0)  is  0.10  +  0.05  -  0.15;  since 
that  combination  of  features  occurs  with  a 
probability  of  0.10,  it  contributes  a  total  of 
0.10-0.15  -  0.015  to  the  overall  minimal 
probability  of  misclassification.  Working 
similarly  with  the  other  combinations 
of  features,  the  latter  probability  is  found 
to  be 


p,_  -  0.134. 

A  decision  tree  with  that  overall  probability 
of  misclassification  is  shown  in  Figure  10a, 
together  with  the  probability  of  its  leaves; 
this  tree  has  an  expected  testing  cost  of  4. 
A  different  tree  is  pictured  in  Figure  10b; 
this  tree  classifies  the  combination  (0, 1,  1) 
in  class  Cj— clearly  not  an  optimal  choice 
in  terms  of  classification  accuracy.  How¬ 
ever,  this  tree  has  an  expected  testing  cost 


Figure  10.  "Hie  two  decision  trees  for  the  pattern 
recognition  problem  of  Example  13:  (a)  with  minimum 
possibility  of  error;  (b)  with  trade-off  for  more  effi¬ 
ciency. 


of  only  2.6,  and  its  overall  probability  of 
misclassification  is  found  to  be  0.164,  barely 
larger  than  optimal.  Thus  it  is  a  difficult 
task  to  decide  which  tree  is  best;  other 
criteria  must  be  used,  such  as  penalties  due 
to  misclassification  or  maximum  permissi¬ 
ble  response  time.  □ 

Faced  with  such  a  variety  of  design  cri¬ 
teria,  researchers  in  the  field  have  explored 
different  routes.  A  tree  is  often  synthesized 
directly  from  the  problem  without  attempt 
to  optimize  its  testing  cost,  but  using  heu¬ 
ristics  designed  to  minimize  the  probability 
of  misclassification  [You76,  Roun79]. 
Hauska  [Haus75]  and  Wu  [Wu75]  de¬ 
scribed  a  local  optimization  algorithm, 
based  on  measures  of  interclass  separation, 
for  the  semiautomatic  design  of  a  decision 
tree  from  a  known  set  of  features.  When 
the  features  are  known  in  advance,  a  pro¬ 
cedure  based  on  dynamic  programming  due 
to  Datt81  can  be  used  to  build  a  decision 
tree  with  minimal  cost,  in  which  the  cost 
criterion  includes  the  expected  testing  cost 
and  the  cost  of  misclassification;  a  similar 
approach  based  on  game  theory  was  de¬ 
scribed  in  Slag71  and  a  third  in  Kulb76. 
When  the  class  assignments  are  made  (and 
the  probability  of  misclassification  there¬ 
fore  fixed),  the  pattern  recognition  problem 
reduces  to  the  description  of  a  (partial) 
function;  Bell78  models  this  case  by  deci¬ 
sion  tables  and  discusses  their  conversion 
to  sequential  testing  procedures  (although 
the  optimal  algorithm  of  Baye73,  PayH77 
is  not  mentioned).  The  accuracy  of  decision 
tree  classifiers  depends  upon  such  consid- 
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erations  as  sample  statistics  (e.g.,  sample 
size),  and  the  number  and  intercorrelation 
of  features;  Kulk78  studied  the  problem 
under  simplified  assumptions  and  con* 
eluded  that  hierarchical  classifiers,  as  their 
single-stage  counterparts,  may  suffer  from 
the  “dimensionality”  problem,  that  is,  show 
decreasing  performance  if  the  number  of 
features  is  increased  beyond  a  certain 
threshold  (which  depends  on  the  sample 
size). 

4.2  Logic  and  Program  Design 

We  mentioned  previously  that  decision 
trees  and  diagrams  for  Boolean  functions 
naturally  give  rise  to  multiplexer  imple¬ 
mentations.  Such  implementations  are  at¬ 
tractive  since  the  resulting  circuits  have  few 
interconnections  and  lend  themselves  well 
to  large-scale  integration  (for  instance,  bi¬ 
nary  trees  form  an  efficient  interconnection 
pattern  [HonoSl]).  Moreover,  multiplexer 
networks  can  be  used  as  universal  logic 
modules  (ULMs)  [Tabl76,  Volt77],  there¬ 
by  reducing  the  number  of  basic  compo¬ 
nents  needed  for  logic  design. 

Decision  tree  representations  of  Boolean 
functions  exhibit  several  advantages  over 
Boolean  formulas.  Lee  [Lee59]  showed 
that  at  most  0(2"/n)  diagram  nodes  are 
required  to  represent  any  Boolean  function 
of  n  variables,  which  compares  very  favor¬ 
ably  with  the  0(2"/log  n)  operators  that 
may  be  needed  by  an  unfactored  Boolean 
formula  [Sava76].  Moreover,  every  opera¬ 
tor  of  the  Boolean  formula  must  be  carried 
out  in  order  to  evaluate  the  formula,  so  that 
up  to  0( 2*/log  n)  operations  may  be  per¬ 
formed,  while  a  decision  diagram  will  never 
require  more  than  n  variable  evaluations. 
Thus  decision  diagrams  express  Boolean 
functions  at  least  as  compactly  as  Boolean 
formulas  and  are  greatly  more  efficient  as 
an  evaluation  tool.  (The  latter  property  is 
used,  e.g.,  for  the  repeated  evaluation  of 
Boolean  queries  in  a  large  database 
[Wong76].)  Finally,  decision  diagrams  lend 
themselves  to  composition  and  recursion, 
as  are  shown  in  the  next  section,  while  the 
same  operations  are  very  difficult  to  carry 
out  using  formulas.  Some  of  those  advan¬ 
tages  were  rediscovered  and  some  pointed 
out  for  the  first  time  by  Aker78,  who  in¬ 


vestigated  the  use  of  binary  decision  dia¬ 
grams  (with  a  slightly  different  definition) 
in  testing  digital  systems. 

In  conjunction  with  the  evaluation  of 
Boolean  functions,  it  must  be  noted  that 
almost  all  Boolean  functions  have  a  pessi¬ 
mal  worst-case  testing  cost  (Le.,  all  vari¬ 
ables  are  tested  on  at  least  one  path).  This 
result,  due  to  RiVE76a,  was  later  comple¬ 
mented  by  MoRE81b,  who  proved  that  all 
symmetric  and  all  linearly  separable  (also 
called  threshold)  Boolean  functions  possess 
this  property.  An  important  consequence  of 
these  results  is  that  synchronous  multi¬ 
plexer  implementations  of  Boolean  func¬ 
tions  in  most  cases  cannot  be  optimized 
with  respect  to  their  propagation  delay, 
since  this  delay  is  determined  by  the  longest 
path  through  the  network. 

Finally  decision  diagrams  can  be  used  to 
model  the  control  structure  of  a  program 
(as  advocated  in  Prat78,  where  they  are 
called  “atomic  digraphs”),  in  particular,  in 
relation  in  Ianov’s  schemata  [Iano60].  It 
then  becomes  important  to  recognize  iden¬ 
tical  structures,  that  is,  to  decide  whether 
or  not  two  decision  diagrams  are  equiva¬ 
lent.  This  problem  is  known  to  be  NP-hard 
[Fort78];  however,  Blum80  described  an 
efficient  algorithm  that  solves  the  equiva¬ 
lence  problem  probabilistically  (i.e.,  which 
provides  an  answer  correct  within  a  given — 
and  re  finable — percentage  of  error). 

4.3  Analysis  of  Algorithms 

The  worst-case  number  of  tests  (the  height) 
of  a  decision  tree  indicates  a  minimum 
number  of  argument  evaluations  that  must 
be  performed  in  order  to  compute  a  func¬ 
tion.  As  such,  finding  the  minimum  height 
of  any  tree  representation  of  a  function  is 
a  useful  technique  for  deriving  bounds  to 
be  used  in  the  worst-case  analysis  of  algo¬ 
rithms.  The  previously  mentioned  results 
of  Rive76s  and  MoRESlb,  showing  that 
large  classes  of  Boolean  functions  require 
any  decision  tree  representation  to  have 
maximal  height,  are  an  example  of  such 
analysis;  indeed,  RivE76b  built  upon  these 
results  to  prove  some  lower  bounds  on  the 
complexity  of  graph  algorithms  based  upon 
a  matrix  representation. 
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The  worst-case  number  of  evaluations 
must  be  at  least  equal  to  the  ratio  of  the 
initial  ambiguity  of  the  function  to  the  up¬ 
per  bound  on  the  information  supplied  by 
each  evaluation.  Since  the  evaluation  of  a 
A-ary  variable  provides  at  most  log2  A  bits 
of  information,  the  height  of  any  tree  for  a 
function,  f,  or  A-ary  variables,  must  obey 
the  relation 

hmin  =s  HfAog3  A.  (10) 

This  relation  has  been  used  to  provide 
lower  bounds  on  the  complexity  of  several 
combinational  problems,  such  as  sorting 
[Knut71]  and  various  set  operations 
[ReiE72].  The  decision  tree  approach  has 
recently  been  generalized  to  handle  proba¬ 
bilistic,  nondeterministic,  and  alternating 
models  of  complexity  [Manb82]. 

5.  RECENT  DEVELOPMENTS 
5.1  Composition  and  Recursion 

The  decision  diagram  model  of  representa¬ 
tion  as  described  so  far  is  limited  to  (partial) 
functions.  Several  extensions  have  recently 
been  proposed  to  include  composition  of 
diagrams  [Aker78,  MoRE80b]  and  model¬ 
ing  of  relations  and  simple  recursion 
[MoRE80b]. 

Whereas  a  function  assigns  at  most  one 
value  to  each  combination  of  variables,  a 
relation  may  assign  any  number  of  values, 
that  is,  any  subset  of  the  set  of  values. 
Thus,  in  particular,  a  relation  accurately 
models  an  ambiguous  decision  table,  where 
the  same  rule  (the  same  combination  of 
variables)  may  specify  more  than  one  set  of 
actions.  In  accordance  with  King73,  we 
assume  that,  when  inconsistent  rules  apply, 
any  of  the  assigned  action  sets  may  actually 
be  chosen  for  execution.  In  terms  of  rela¬ 
tions,  a  decision  tree  implementation  can 
choose  to  specify  for  each  combination  of 
variables  any  (or  some,  or  all)  of  the  values 
from  the  subset  assigned  to  that  combina¬ 
tion  by  the  relation.  Since  combinations  of 
variables  for  which  the  function  (relation) 
is  not  defined  are  usually  allowed  to  take 
any  convenient  value,  we  may  assume  that 
such  ombinations  are  in  fact  related  to  the 
whole  set  of  values.  Another  consequence 
of  our  assumptions  is  that  a  relation  is 
constant  exactly  when  the  intersection  of 


the  subsets  of  values  assigned  to  all  the 
combinations  of  its  variables  is  not  empty 
(for  one  can  choose  to  use  any  of  the  values 
present  in  the  intersection  for  each  of  the 
combinations  of  variables,  thereby  effec¬ 
tively  transforming  the  relation  into  a  con¬ 
stant  function).  Under  such  assumptions, 
all  of  the  results  discussed  so  far  apply  to 
the  representation  and  evaluation  of  rela¬ 
tions  [MoRE80b]. 

A  particularly  important  tool  in  the  anal¬ 
ysis  of  problems  is  decomposition,  which 
tries  to  simplify  a  problem  by  partitioning 
it  into  smaller  parts  (the  rationale  being 
that  the  complexity  of  the  whole  is  more 
than  the  sum  of  the  complexities  of  its 
parts);  “divide  and  conquer”  is  a  time-hon¬ 
ored  aspect  of  decomposition.  Conversely, 
composition  is  an  important  tool  in  synthe¬ 
sis.  We  have  seen  that  decision  trees  induce 
a  natural  decomposition — Shannon’s  de¬ 
composition;  thus  it  remains  to  demon¬ 
strate  how  to  compose  decision  trees  and 
diagrams.  Two  such  compositions  can  be 
distinguished:  leaf  composition  and  node 
composition. 

Leaf  composition  stems  from  the  simple 
hierarchical  idea  of  a  “tree  of  trees.”  As  an 
example,  consider  a  pattern  recognition 
problem  such  as  bird  identification.  To 
most  of  us,  identifying  a  bird  as  a  “sparrow” 
or  a  "dove”  is  sufficient;  to  a  bird  watcher 
or  an  ornithologist,  however,  this  is  a  vague 
classification  that  must  be  greatly  refined 
to  include  species  and  subspecies.  This  sug¬ 
gests  a  two-step  identification,  in  which  a 
rough  classification  is  first  made,  followed 
by  a  highly  specialized  procedure  for  fur¬ 
ther  refinements.  This  offers  several  advan¬ 
tages:  (1)  it  is  practical  since  it  can  be  of 
use  to  both  uninitiated  and  specialists;  (2) 
it  is  efficient  since  the  evaluation  can  be 
profitably  (for  the  uninitiated)  stopped 
after  the  first  stage;  and  (3)  it  can  be  greatly 
optimized  in  the  second  stage  since  it  is 
then  possible  to  design  highly  specialized 
tests.  The  first  step  in  such  a  design  consists 
of  a  single  decision  tree;  most  of  the  leaves 
of  the  tree,  however,  do  not  give  a  value, 
but  rather  designate  another  decision  tree 
to  be  used  in  the  second  step.  Thus  one  can 
stop  the  evaluation  upon  reaching  a  leaf  of 
the  first  tree,  taking  the  “name”  of  the 
second  tree  as  the  result  of  the  evaluation, 
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or  continue  evaluation  by  proceeding  to  the 
second  tree.  The  latter  choice  results  in  the 
composition  of  the  two  trees,  and  it  is  called 
leaf  composition  since  it  replaces  a  leaf  by 
another  tree. 

Formally,  then,  leaf  composition  of  two 
trees  is  the  process  of  attaching  the  second 
tree  in  place  of  appropriate  leaves  in  the 
first  tree.  (The  analog  in  the  software  world 
is  a  transfer  of  control  between  modules 
without  transfer  of  information,  such  as 
chaining.)  Clearly,  the  second  tree  cannot 
share  variables  with  the  first  lest  the  com¬ 
posed  tree  test  the  same  variable  twice  on 
some  path.  A  special  case  of  interest  is  the 
composition  of  decision  trees  for  Boolean 
functions,  where  the  second  tree  is  attached 
in  place  of  every  leaf  with  the  same  label  in 
the  first  tree.  One  easily  verifies  that  such 
a  composition  results  in  a  logical  OR  of  the 
two  functions  when  the  leaves  labeled  “0” 
are  replaced,  and  in  a  logical  AND  when 
the  leaves  labeled  “1”  are  used  instead. 
Moreover,  the  composition  is  then  com¬ 
mutative  (in  terms  of  the  function  it  yields), 
just  as  the  logical  operation  that  it  imple¬ 
ments.  Figure  11a  shows  two  trees,  which 
are  OR  composed  in  Figure  lib  and  AND 
composed  in  Figure  11c. 

The  behavior  of  the  various  optimization 
criteria  under  leaf  composition  is  simple. 
The  storage  cost  of  the  composition  is  the 
sum  of  the  cost  of  the  first  tree  and,  for 
each  replaced  leaf,  of  the  cost  of  the  second 
tree.  This  can  be  simplified  for  diagrams; 
since  only  one  leaf  of  each  label  can  exist, 
the  storage  cost  of  the  composed  diagram 
is  just  the  sum  of  the  costs  of  the  compo¬ 
nent  diagrams.  The  expected  testing  cost  of 
the  composition  is  the  sum  of  the  cost  of 
the  first  tree  and  of  the  cost  of  the  second 
tree,  the  latter  being  multiplied  by  the 
probability  that  one  of  the  replaced  leaves 
will  be  reached.  Thus  all  the  techniques 
used  for  the  optimization  of  decision  trees 
are  applicable  to  the  optimization  of  com¬ 
posed  trees  (see  MoRE80b).  In  connection 
with  the  composition  of  Boolean  decision 
trees  mentioned  above,  Perl76  proved 
that,  on  the  average,  the  order  of  composi¬ 
tion  is  irrelevant  to  the  expected  testing 
cost  of  the  composed  tree.  (Notice,  how¬ 
ever,  that  this  is  clearly  not  the  case  for  the 
OR  compositions  illustrated  in  Figure  lib.) 
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Figure  11.  An  example  of  leaf  composition:  (a)  the 
too  functions;  (b)  left  and  right  OR  compositions;  (c) 
left  and  right  AND  compositions. 


Whereas  the  rationale  behind  the  leaf 
composition  was  the  progressive  refinement 
of  function  values,  node  composition  intro¬ 
duces  a  refinement  of  the  values  of  the 
variables.  Thus  the  former  is  associated 
with  an  explicit  tree  hierarchy,  while  the 
latter  induces  a  functional  hierarchy.  In  a 
node  composition,  a  A-ary  variable  is  re¬ 
placed  by  a  tree  with  at  least  one  leaf  la¬ 
beled  with  each  value  from  0  to  A  —  1;  this 
tree  specifies  how  to  evaluate  the  variable 
it  replaces.  (Thus  the  software  analog  is  the 
use  of  an  auxiliary  procedure  that  returns 
a  value,  i.e.,  a  subroutine  call.)  The  effect 
of  node  composition  on  functions  expressed 
by  formulas  is  just  the  substitution  in  the 
first  function’s  formula  of  the  second  func¬ 
tion’s  formula  for  each  appearance  of  the 
node  variable.  In  terms  of  trees,  the  com¬ 
posed  tree  is  obtained  by  using  the  follow¬ 
ing  procedure  for  each  occurrence  of  the 
specified  node;  replace  the  node  by  the 
second  tree  and  attach  the  /th  subtree  of 
the  node  to  each  leaf  labeled  j  in  the  second 
tree.  As  for  leaf  composition,  the  trees  used 
in  node  composition  must  not  share  any 
variable.  Figure  12  illustrates  a  node  com¬ 
position.  Aiers  [Aker78]  proposed  this 
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Figure  12.  An  example  of  node  composition:  (a)  the 
trees  for  the  function  (left)  and  the  variable  (right); 
(b)  the  composed  tree. 


composition  and  studied  its  use  in  the  de¬ 
sign  and  analysis  of  logic  functions. 

It  is  noted  that  the  behavior  of  the  opti- 
mization  criteria  under  node  composition  is 
somewhat  complex.  While  the  diagram 
storage  cost  of  a  composition  is  simply  the 
sum  of  the  cost  of  the  first  diagram  and,  for 
each  replaced  node,  of  the  cost  of  the  sec¬ 
ond  diagram,  the  other  costs  depend  on 
exact  structure  of  the  second  tree.  In  par¬ 
ticular,  it  is  necessary  to  know  how  many 
leaves  of  the  second  tree  share  the  same 
label,  say  label  j,  since  the  /th  subtree  of 
the  replaced  node  will  be  attached  to  each 
of  these  leaves.  However,  the  optimization 
methods  described  in  this  article  can  be 
applied  with  suitable  modifications. 

Both  modes  of  composition  can  be  used 
at  once.  Recent  research  on  the  problem  of 
bacteriological  identification  in  a  clinical 
environment  [Shap81]  suggests  as  the  ini¬ 
tial  model  a  user-specified  identification 
tree  that  simply  describes  a  hierarchy  of 
classes  and  subclasses  of  bacteria  (as  deter¬ 
mined  by  local  factors  such  as  common 
mode  of  treatment  in  initial  stages,  like¬ 
lihood  of  occurrence  in  the  geographical 
area,  etc.).  The  hierarchy  involves  leaf  com¬ 
position,  while  the  actual  tests  to  be  used 
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for  identification  are  specified  by  node  com¬ 
position. 

In  discussing  both  modes  of  composition, 
we  have  purposefully  avoided  a  delicate 
problem:  What  if  a  tree  is  composed  with 
itself?  In  such  a  case,  the  leaves  or  nodes  of 
the  first  tree  are  replaced  by  the  identical 
tree,  so  that  more  replacements  are  possi¬ 
ble,  and  so  on.  This  creates  an  infinite  re¬ 
cursion,  giving  rise  to  an  infinite  tree  (or  a 
diagram  with  cycles).  Although  composing 
a  tree  with  itself  may  appear  contrived, 
recursion  is  a  sufficiently  fundamental  phe¬ 
nomenon  that  a  study  of  its  effects  on  de¬ 
cision  trees  is  warranted.  Unfortunately, 
very  little  work  has  been  done  in  this  area. 

Moret  [MORESOb]  studied  the  recursion 
due  to  a  single  leaf  composition  and  showed 
that  the  concepts  of  expected  testing  cost, 
diagram  storage  cost,  and  activity  can  be 
extended  to  such  simple  recursive  trees. 
Although  the  tree  storage  cost  is  clearly 
infinite,  as  is  the  worst  case  testing  cost, 
the  expected  testing  cost  is  generally  finite. 
This  stems  from  the  fact  that  there  usually 
is  a  nonzero  probability,  call  it  e,  of  reaching 
a  nonreplaceable  leaf  in  the  component 
tree,  so  that  the  probability  of  recursing 
one  level,  1  —  e,  is  less  than  one.  Now,  the 
probability  of  recursing  k  levels  is  just 
(1  —  e)*,  so  that  the  average  number  of 
recursion  levels  used  before  termination  is 

r.v  -  £  (1  -  e)*  «  -. 

*-o  e 

This  factor  can  be  used  for  transforming 
the  expected  testing  cost  of  a  single,  non¬ 
recursive  copy  of  the  tree  into  the  expected 
testing  cost  of  the  recursive  tree;  the  activ¬ 
ities  of  the  variables  can  be  computed  sim¬ 
ilarly. 

5.2  Applications  to  Testing 

Decision  trees  have  long  been  used  for  pur¬ 
poses  of  fault  diagnosis,  as  previously  seen. 
Such  uses,  however,  apply  decision  trees  to 
the  analysis  of  a  discrete  function  which  is 
not  that  which  they  represent.  Akers 
[Aker78,  Aker79]  proposed  that  binary 
decision  diagrams  be  used  as  the  basis  for 
developing  tests  for  the  Boolean  function 
which  they  represent.  Some  of  the  reasons 
given  have  been  discussed  above,  such  as 
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compactness,  good  structure,  and  existence 
of  composition.  Other  reasons  include  the 
dose  relationship  between  binary  decision 
diagrams  and  standard  logic  design  and  the 
ease  of  automation  in  handling  diagrams. 

Moret  [More81s]  advocated  the  use  of 
decision  trees  as  alternate  models  for  sys¬ 
tem  analysis.  A  standard  tool  in  the  analysis 
of  system  reliability  is  the  fault  tree 
[Barl75],  which  is  just  a  graphical  repre¬ 
sentation  (using  AND  and  OR  gates)  of  the 
Boolean  function  describing  a  system’s 
state  in  terms  of  the  state  of  its  components 
(where  the  only  possible  values  are 
“working”  and  “failing”).  Fault  trees  are 
used  for  the  assessment  of  the  overall  reli¬ 
ability  of  a  system  and  of  its  sensitivity  to 
the  state  of  various  components.  The  same 
analysis  can  be  performed  using  decision 
trees  and  activities,  with  added  advantages; 
(1)  as  seen,  decision  trees  are  more  efficient 
than  formulas  (which  is  essentially  what 
fault  trees  are);  (2)  decision  trees  can  be 
used  for  multivalued  functions,  whereas 
fault  trees  are  restricted  to  Boolean  func¬ 
tions;  and  (3)  decision  trees  can  include 
recursion  and  general  composition  opera¬ 
tions,  while  fault  trees  are  limited  to  an 
equivalent  of  node  composition.  A  possible 
drawback,  however,  is  that  the  manipula¬ 
tion  of  Boolean  formulas,  while  inherently 
inefficient,  is  well  understood  and  has 
been  successfully  implemented  (e.g.,  see 
Worr75),  whereas  that  of  decision  trees  is 
still  in  its  infancy. 

6.  CONCLUSION 

This  article  has  provided  a  unified  frame¬ 
work  of  definitions  and  notation  for  deci¬ 
sion  trees  and  diagrams;  it  has  examined 
the  problem  of  optimization,  reviewed  the 
main  applications  and  contributions,  and 
described  some  recent  developments.  While 
often  difficult  to  optimize,  decision  trees 
and  diagrams  emerge  as  efficient  represen¬ 
tations  of  discrete  functions,  of  particular 
interest  in  pattern  recognition,  logic  design, 
programming  methodology,  and  system 
analysis.  Although  several  research  pro¬ 
grams  are  actively  concerned  with  decision 
trees,  further  areas  of  study  have  been  iden¬ 
tified,  notably  the  quality  of  optimization 
heuristics  and  the  use  of  general  recursion. 
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Abstract 

With  the  advent  of  large  scale  Integration,  testing  methods  must  be  developed 
which  rely  solely  on  the  input-output  behavior  of  systems,  thereby  requiring  an 
Implementation-Independent  model  of  system  behavior.  Such  a  model,  the  decision 
tree,  which  has  proved  of  great  use  in  many  areas  of  Computer  Science,  is  briefly 
presented.  Using  this  model,  a  measure  of  the  complexity  of  multivalued  discrete 
functions  is  developed,  as  well  as  a  measure  of  the  contribution  of  individual 
variables  to  the  overall  complexity.  The  latter  concept,  the  activity  of  a  variable, 
is  shown  to  have  considerable  potential  for  the  design  of  incomplete  testing  proce¬ 
dures.  In  particular,  exercising  those  variables  which  have  the  largest  activity 
maximizes  the  probability  of  error  detection  in  systems  with  equally  likely  faults. 
Finally,  activity  is  shown  to  be  a  powerful  tool  for  the  analysis  of  multivalued 
fault- trees,  thereby  allowing  the  application  of  some  digital  testing  techniques  to 
analog  systems  modelled  by  multivalued  functions. 
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Introduction 


As  the  size  and  complexity  of  new  integrated  systems  increase,  the  need  arises 
for  methods  of  analysis,  testing,  and  design  which  are  implementation-independent, 
using  only  input-output  specifications.  Of  particular  importance  is  the  ability  to 
evaluate  the  complexity  of  a  problem,  as  well  as  how  individual  variables  contri¬ 
bute  to  It,  in  order  to  select  an  appropriate  set  of  analytical  tools  and  establish 

guidelines  for  testing  and  design  procedures. 

•  • 

One  implementation-independent  model  of  discrete  function  evaluation,  the  de¬ 
cision  tree,  has  long  been  used  in  Computer  Science  for  establishing  lower  bounds 
on  the  complexity  of  problems  (e.g.,  Knuth  71),  designing  switching  circuits  (e.g., 
Cemy  79),  or  establishing  classification  procedures  for  pattern  recognition  (Bell 
78)  and  machine  diagnosis  (Chang  70).  A  decision  tree  is  essentially  a  sequential 
evaluation  procedure,  whereby  the  value  of  a  variable  (test)  is  determined  and  the 
next  action  (to  select  another  variable  to  evaluate  or  to  output  the  value  of  the 
system's  function)  is  chosen  accordingly.  In  particular,  decision  trees  can  be  used 
to  determine  the  state  of  a  system  (Halpern  74);  Figure  la  shows  a  possible  decision 
tree  for  the  state  of  the  simple  system  pictured  in  Figure  lb. 

In  a  previous  Investigation  (.Moret,  Thomason,  and  Gonzalez  80,  81),  the  authors 
generalized  the  decision  tree  model  to  Include  composition  and  simple  recursion,  thus 
allowing  the  modelling  of  hierarchical  systems  with  simple  feedback..  They  also  de¬ 
veloped  a  new  complexity  measure  for  discrete  functions,  the  intrinsic  cost,  as  well 
as  a  measure  of  the  contribution  of  a  variable  to  the  complexity  of  the  function, 
and  showed  the  close  relationship  existing  between  these  measures  and  the  decision 
tree  model.  As  detailed  below,  the  concept  of  activity  shows  considerable  potential 
as  a  tool  for  system  testing. 

Activity  and  incomplete  testing 

In  an  input-output  system,  a  failure  is  character! zed  by  a  deviation  from  the 
expected  output  signal  (that  is,  a  different  value  for  discrete  systems  and  a  value 


.  59 

out  of  tolerance  for  analog  systems}.  This  approach  is  known  as  signal  reliability 

(Koren  79),  in  contrast  with  the  conventional  functional  reliability,  which  considers 

all  Internal  (and  possibly  non-cri ti cal )  system  faults.  Signal  reliability  is  thus 

more  accurate  and  better  suited  to  large  Integrated  systems.  - 

The  thorough  testing  of  a  system  can  only  be  done  by  exhaustion;  such  an  approach 
however,  is  unfeasible  for  all  but  the  simplest  systems.  Thus  one  is  forced  to  use 
some  methods  of  Incomplete  testing.  In  the  case  of  combinational  (i.e.,  memoryless) 
discrete  circuits,  (Losq  78)  has  shown  that  random  compact  testing,  a  method  which 
applies  a  sequence  of  random  input  vectors  to  a  system  and  compares  some  output  sta¬ 
tistics  with  those  gathered  from  a  perfect  ("gold")  unit,  can  yield  very  reliable 
estimates  at  only  a  small  fraction  of  the  cost  of  exhaustive  testing.  Often,  how¬ 
ever,  such  a  method  is  inapplicable,  because  not  all  inputs  are  controllable;  in  a 
system  with  memory  (feedback),  for  Instance,  the  values  of  the  feedback  variable 
often  cannot  he  either  examined  or  modified  (as  illustrated  in  Figure  2). 

When  only  a  fraction  of  the  variables  is  accessible  or  when  only  the  most  "im¬ 
portant"  variables  must  be  tested,  exhaustive  testing  can  be  used  with  a  selected 
subset  of  variables.  Such  a  subset  must  be  chosen  such  that  the  probability  of  de¬ 
tecting  a  malfunction  is  maximized.  The  authors  have  shown  that  when  all  malfunc¬ 
tions  are  equally  likely,  such  a  subset  must  consist  of  these  variables  which  have 
•  ♦ 

the  largest  activity  (Moret  80).  Thus  the  activity  of  a  variable  measures,  in  some 
sense,  how  important  a  variable  is  to  the  correct  functioning  of  a  system. 

Activity  and  fault  trees. 

A  complex  system  is  rarely  specified  as  a  whole,  but  is  conceived  as  a  structure 
of  simpler  subsystems  which  interact  by  communicating  the  values  of  variables.  Reli¬ 
ability  analysis  is  then  carried  out  on  the  structural  relationships  by  representing 
each  subsystem  by  a  single  variable  qualifying  its  operating  state;  when  such  vari¬ 
ables  are  binary,  taking  the  values  "works"  or  "fails",  this  leads  to  the  fault  tree 
model,  which,  has  found  widespread  use  in  industry  (Fussell  79,  Reactor  Safety  Study 


A  fault  tree  Is  essentially  a  Boolean  function  describing  the  set  of  conditions 
(on  the  subsystems)  necessary  to  make  a  complete  system  fail.  Figure  3  shows  a  possi¬ 
ble  fault  tree  for  the  system  of  Figure  lb.  Obviously,  each  subsystem  can  in  turn 
be  decomposed  and  modelled  in  the  same  way.  Fault  trees  are  used  to  determine  the 
probability  of  failure  of  a  system  as  well  as  for  the  study  of  the  role  of  individual 
subsystems.  A  tool  conmonly  used  for  the  latter  purpose  is  the  Boolean  differential 
calculus  (Bennetts  75,  Thomason  and  Page  76).  The  Boolean  difference  of  a  Boolean 
function,  f  ,  with  respect  to  one  of  its  variables,  x,  is  the  function: 

df/dx  *  f|x=o  ®  f|xao 

where  $  denotes  summation  modulo  2.  It  is  well  known  that  df/dx=l  exactly  when 
critically  depends  on  x,  so  that  the  probability  that  a  system  represented  by  f 
fails  due  to  the  failure  of  subsystem  x  is 

poob(df/dxal) •  p/tob(x  fails). 

The  Boolean  difference  is  closely  related  to  the  activity  of  a  variable  (Moret, 
Thomason,  and  Gonzalez  80):  when  all  probabilities  are  equal,  the  activity  of  vari¬ 
able  x  reduces  to  pao6(df/dxal) . 

Thus  the  activity  of  a  variable  is  a  natural  extension  to  Boolean  difference 
analysis;  unlike  the  latter,  it  Is  applicable  to  arbitrary  multivalued  functions 
(as  opposed  to  multi-valued  logic  (Bell,  Page,  and  Thomason  72)),  which  makes  it  the 
tool  of  choice  for  the  analysis  of  multivalued  fault  trees. 


Apolications  to  Analog  Systems 


In  analog  systems,  it  is  often  difficult  to  decompose  a  system  so  that  its  com¬ 
ponents  can.be  characterized  as  either  perfect  or  faulty.  Rather,  the  observed  out¬ 
put  signal  deviates  in  some  measure  from  the  ideal  output.  Small  deviations  are 
potentially  acceptable  and  large  ones  probably  not,  but  there  is  an  intermediate  zone 
in  which  a  signal  can  be  just  above  tolerance  without  falling  in  either  category. 

This  situation  is  sunsnarized  in  Figure  4. 

It  is  conceivable  that  a  cascade  of  two  subsystems,  each  of  which  is  above 
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tolerance  but  not  faulty,  results  in  a  faulty  system.  In  order  to  model  this  situa¬ 
tion,  a  subsystem  must  be  described  not  by  a  binary  variable,  but  by  a  multivated 
variable,  taking  for  instance  the  values  "fault",  "below  tolerance",  "above  toler¬ 
ance",  and  "perfect."  Then  the  failure  function  is  not  a  Boolean  function  but  a 
general  discrete  function;  fault  tree  models  must  be  generalized  and  Boolean  calculus 
is  no  longer  applicable,  so  that  activity  becomes  the  main  tool  for  analysis. 

Cone! usi on 

•  « 

The  activity  of  a  variable,  a  new  concept  which  measures  the  contribution  of  a 
variable  to  the  (testing)  complexity  of  a  discrete  function,  has  been  introduced. 

It  has  been  shown  to  be  of  great  potential  as  a  tool  for  the  analysis  and  the  testing 
of  both  discrete  and  analog  systems. 
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Figure  1.  A  possible  decision  tree  (a)  for  a  simple  system  (b), 


Figure  2.  A  system  with  memory  (feedback)  shewing  inaccessible  internal 
variables. 
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Figure  3.  A  fault  tree  for  the  system  of  Figure  lb. 
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Figure  4.  Subdivisions  of  the  range  of  an  analog  signal 


OPTIMIZATION  CRITERIA  FOR  DECISION  TREES 
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ABSTRACT 

Decision  trees  are  a  model  of  the  sequential  evaluation  of  discrete  functions  that 
have  widespread  applications  in  pattern  recognition,  taxonomy,  decision  table  program* 
ming,  databases,  switching  theory,  and  concrete  complexity  theory.  Since  a  function  in 
general  has  numerous  decision  tree  representations,  it  is  necessary  to  adopt  some  selec¬ 
tion  criterion  in  order  to  obtain  the  most  appropriate  representation.  Several  such 
optimization  criteria  have  been  proposed  in  the  literature,  but  few  have  been  studied 
together  and  the  choice  of  a  criterion  has  not  often  been  directly  addressed. 

This  paper  regroups  those  criteria  into  a  common,  generalized  framework,  and  exam¬ 
ines  their  interrelationships.  It  is  shown  that,  even  in  the  simplest  cases,  most  criteria 
cannot  be  optimized  simultaneously,  thereby  disproving  some  conjectures  found  in  the 
literature.  Two  new  results  are  presented  concerning  the  worst-case  number  of  argument 
evaluations  for  Boolean  functions.  On  the  basis  of  the  accumulated  results,  it  is  argued 
that  two  optimization  criteria  have  widespread  relevance;  the  computational  complexity 
of  these  criteria,  is  examined  in  detail. 
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1.  Introduction 


A  decision  tree  is  a  model  of  the  evaluation  of  a  discrete  function,  wherein  the  value  of  a 
variable  is  determined  and  the  next  action  (to  choose  another  variable  to  evaluate  or  to 
output  the  value  of  the  function)  is  chosen  accordingly.  Decision  trees  have  many  appli¬ 
cations  in  pattern  recognition  [20,25],  taxonomy  and  identification  [5,6,11],  decision 
table  programming  [1,15,16,21,22,24,28,28],  switching  theory  [2,3,4,13],  and  analysis  of 
algorithms  [23,27].  More  recently,  they  have  been  proposed  as  implementation- 
independent  models  of  discrete  functions  with  a  view  to  the  development  of  new  com¬ 
plexity  measures  [18]. 

Since  variables  can  be  tested  in  arbitrary  order  during  the  sequential  evaluation  pro¬ 
cedure,  a  given  discrete  function  has,  in  general,  numerous  decision  tree  representations. 
Thus,  it  is  necessary  to  develop  some  criterion  for  the  selection  of  an  appropriate  tree, 
that  is,  to  develop  some  measure  on  decision  trees.  Several  mesures  have  been  proposed 
in  the  literature  [8,13,21,22],  and  the  multiplicity  of  criteria  presents  the  user  with  a 
problem  of  choice. 

This  article  discusses  several  of  these  measures  within  a  common  framework  of 
definitions  and  notation.  After  providing  a  formal  definition  of  decision  trees  and 
expressing  the  various  proposed  criteria  in  the  established  framework,  we  briefly  review 
the  published  optimization  algorithms  to  place  the  optimization  problem  in  perspective. 
The  relationships  between  measures  are  then  studied,  beginning  with  the  simplified  case 
of  Boolean  functions  (the  case  that  is  least  conducive  to  incompatibilities).  It  is  shown 
that,  even  in  this  case,  almost  all  criteria  are  pairwise  incompatible,  that  is,  they  cannot 
be  simultaneously  optimized  for  all  functions.  This  disproves  some  conjectures  found  in 
the  literature  [2,28].  The  special  case  of  binary  identification  is  then  separately  exam¬ 
ined.  Finally,  we  discuss  each  criterion  in  turn  and  examine  its  computational  complex¬ 
ity.  Several  criteria  are  found  to  have  limited  applicability  due  to  their  specific  behavior; 
in  particular,  we  extend  a  result  of  Rivest  [23]  by  showing  that  all  symmetric  and  all 
linearly  separable  Boolean  functions  are  exhaustive,  i.e.,  have  maximal  worst-case  test¬ 
ing  cost.  We  conclude  by  suggesting  that  two  specific  measures,  related  to  run-time  cost 
and  Tetention  cost  of  trees,  are  the  most  generally  useful  optimization  criteria  at  this 
time. 

2.  Preliminaries 

The  following  formal  definition  of  a  decision  tree  appears  in  [18,19]. 

Definition  1.  Let  f  be  a  (partial)  function  of  discrete  variables,  xt,  ...  ,xm,  where  vari¬ 
able  x,  takes  on  m{  values  (denoted  0,  . . . ,  m,— 1).  If  f  is  a  constant,  then  the  decision 
tree  for  f  consists  of  a  single  leaf  labelled  by  that  constant;  otherwise,  for  each 
*,■,!<*  <n,  f  decision  tree(s)  composed  of  a  root  labelled  *,•  and  m{  subtrees 

corresponding  to  the  restrictions  (hereafter  called  subfunctions)  / 1  , ...  , 

/  I  that  order.  Q 
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It  is  noted  that  the  same  subtree  may  occur  on  several  branches  of  the  tree,  in  which 
case  it  may  be  desirable  to  use  only  one  copy  of  that  subtree  by  transforming  the  deci¬ 
sion  tree  into  a  decision  diagram  with  a  rooted  directed  acyclic  graph  structure.  To 
every  decision  diagram  there  corresponds  a  unique  decision  tree  with  a  one-to-one 
correspondence  between  its  paths  and  those  in  the  tree. 

Two  costs  are  usually  associated  with  each  variable  of  a  function:  a  testing  cost 
measures  the  expense  (in  time  or  any  resource  associated  with  evaluation  of  that  vari¬ 
able)  incurred  each  time  that  variable  is  evaluated;  and  a  storage  eost  measures  the 
expense  (in  storage  or  any  resource  associated  with  the  presence  of  that  test)  due  to  the 
presence  of  each  test  node  labelled  by  that  variable.  In  addition,  a  probability  distribu¬ 
tion  is  often  specified  on  the  variables’  space  and  can  be  assumed  uniform  if  not  other¬ 
wise  known.  These  data  allow  the  computation  of  the  following  six  measures. 

Definition  2. 

i)  The  total  testing  cost,  rj,  is  the  sum,  taken  over  all  the  paths  from  the  root  to  the 
leaves,  of  the  path  testing  costs,  where  the  testing  cost  of  a  path  is  the  sum  of  the 
testing  costs  of  the  variables  evaluated  on  that  path.  When  all  testing  costs  are 
unity,  t)  reduces  to  the  external  path  length  [12],  itself  a  special  case  of  the  tree 
path  entropy  defined  in  [8]. 

ii)  The  normalized  testing  cost,  H,  is  the  total  testing  cost  divided  by  the  number  of 
paths.  When  all  testing  costs  are  unity,  H  reduces  to  the  average  path  length, 
itself  a  special  case  of  the  normalized  tree  path  entropy  [8]. 

iii)  The  worst-case  testing  cost,  h,  is  the  maximum  path  testing  cost.  When  all  testing 
costs  are  unity,  h  reduces  to  the  worst-case  number  of  tests,  that  is,  the  height  of 
the  tree  or  diagram. 

iv)  The  expected  testing  cost,  E,  is  the  expected  value  of  the  path  testing  cost,  where 
the  probability  of  a  path  is  the  sum  of  the  probabilities  of  all  the  combinations  of 
variables’  values  that  select  that  path. 

v)  The  tree  storage  cost,  a,  is  the  sum,  taken  over  all  the  internal  nodes  of  the  tree, 
of  the  storage  costs  of  the  associated  variables.  When  all  storage  costs  are  unity,  a 
is  the  total  number  of  internal  nodes  of  the  tree. 

vi)  The  diagram  storage  cost,  0,  is  the  same  sum  as  in  (v)  taken  over  all  the  internal 
nodes  of  the  diagram.  Q 

It  is  noted  that,  in  the  case  of  unity  testing  and  storage  costs  and  uniform  probability 
distribution,  the  only  datum  needed  to  compute  the  first  five  measures  is  the  number  of 
leaves  at  each  level  of  the  tree.  Thus,  a  decision  tree  for  a  function  of  n  variables  can  be 
entirely  characterized  by  an  (n+i)-tuple,  (X0,  .  . .  ,X„),  where  X,-  is  the  number  of  leaves 
at  level  i.  This  notation  is  called  the  leaf  profile  [18]  by  analogy  with  a  similar  notation 
defined  in  [17].  The  leaf  profile  induces  a  lexicographic  ordering  of  decision  trees,  which 
in  turn  gives  rise  to  two  additional  measures:  the  maximum  profile,  which  ranks  as  best 
that  tree  which  is  largest  in  lexicographic  order  (on  the  grounds  that  leaves  should  be 
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encountered  as  soon  as  possible);  and  the  minimum  reverse  profile,  which  ranks  as  best 
that  tree  which  is  smallest  in  reverse  lexicographic  order  (on  the  grounds  that  the 
number  of  long  paths  should  be  minimized). 

As  an  example  of  the  above  concepts,  consider  the  Boolean  function  of  four  variables 
given  by  the  formula, 

/(*  I»*2>*3>*4)  =  *1*2  +  *1*4  +  *2*3- 

Figure  1  shows  a  decision  diagram  and  its  corresponding  decision  tree  for  f;  the  various 
measures  are  as  follows: 

external  path  length,  tj  =  17; 
average  path  length,  H  =  2.83; 
expected  number  of  tests,  E  =  2.375; 
tree  node  count,  a  =  5; 
diagram  node  count,  fi  =  4; 
leaf  profile  =  (0,0,3, 1,2). 

In  [18,19],  decision  trees  and  diagrams  are  extended  to  include  composition  and  recur¬ 
siveness,  and  it  is  shown  that  the  measures  defined  above  can  be  applied  to  this  general¬ 
ized  case. 

An  interesting  application  of  discrete  functions  is  that  of  binary  identification.  As 
defined  in  [6],  ah  identification  problem  consists  of  a  set  of  objects,  a  set  of  binary  ques¬ 
tions,  and  an  injective  map  from  the  set  of  objects  to  the  power  set  of  the  set  of  ques¬ 
tions;  the  image  of  an  object  is  then  the  unique  combination  of  positively  answered 
questions  which  identifies  that  object.  In  the  context  of  this  paper,  the  questions  are 
binary  variables  and  the  objects  are  values  of  a  bijective  partial  function  from  the  vari¬ 
ables'  space  to  the  set  of  objects.  It  is  readily  seen  that  all  decision  trees  for  such  a  func¬ 
tion  have  exactly  one  leaf  for  each  object  and  thus  have  all  the  same  number  of  nodes. 
Moreover,  the  fact  that  no  two  leaves  have  the  same  label  means  that  there  cannot  exist 
common  subtrees,  so  that  all  decision  diagrams  are  in  fact  trees  and,  for  each  diagram, 
0—Q-  By  reason  of  these  and  other  peculiarities,  the  case  of  binary  identification  will  be 
treated  independently  in  Section  5. 

3*  The  Construction  Of  Optimal  Decision  Trees  And  Diagrams 

The  problem  of  constructing  decision  trees  and  diagrams  that  are  optimal  with  respect 
to  the  various  criteria  has  been  addressed  by  numerous  researchers  using  branch-and- 
bound  techniques,  dynamic  programming,  and  various  heuristics.  A  survey  of  their 
efforts  can  be  found  in  [18]. 

Dynamic  programming  is  of  particular  interest  because  several  measures  obey  the 
“principle  of  optimality,"  that  is,  they  have  the  property  that  an  optimal  solution  can 
be  built  from  optimal  subsolutions.  Indeed,  if  variable  x{,  with  testing  cost  t{  and 
storage  cost  th  is  tested  at  the  root  of  a  decision  tree  for  the  function  /(*t,  ...,*„)',  then 
the  optimal  values  for  such  a  tree  for  three  of  the  measures  are 


^inin(/)  —  {^min(/  I  o)»  ^aiaif  I  m,-l)}> 
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where  p (*;=j)  denotes  the  probability  that  z;  takes  on  the  value  j.  Similarly,  the  leaf 
profile  of  this  tree  is  the  sum,  component  by  component,  of  the  leaf  profiles  of  its  sub¬ 
trees.  Hence,  five  out  of  the  eight  proposed  measures  obey  the  principle  of  optimality. 


This  approach  is  embodied  in  an  algorithm  designed  to  convert  limited-entry  deci¬ 
sion  tables  into  decision  trees  with  minimal  expected  testing  cost  [1],  later  rediscovered 
[24]  and  refined  [15];  a  closely  related  procedure  appears  in  [20].  This  algorithm  is  easily 
adapted  to  any  of  the  five  possible  criteria  and  to  the  most  general  type  of  decision  tree 
[18].  For  a  function  of  n  k-ary  variables,  the  algorithm  requires  0(n  (t+1)*-1)  steps; 
since  a  complete  specification  of  the  function  necessitates  an  input  of  size  s  =  0[km), 
the  time  complexity  is  0(sl°s*^+1^log  s).  Dynamic  programming  offers  an  efficient 
solution  to  the  optimization  problem  for  those  measures  in  the  case  of  completely 
specified  functions. 


In  the  case  of  binary  identification  (and,  more  generally,  of  partial  functions),  how¬ 
ever,  a  similar  dynamic  programming  algorithm  [6]  is  of  exponential  complexity  because 
the  specification  of  the  problem  is  very  much  shorter  than  for  complete  functions,  result¬ 
ing  in  a  much  smaller  input.  Indeed,  it  has  been  proved  [10,14]  that  the  problem  of  con¬ 
structing  binary  identification  trees  with  minimal  expected  testing  cost  is  NP-hard. 


The  remaining  three  measures,  fi,  tj,  and  H,  are  not  easily  optimized.  Branch-and- 
bound  techniques,  used  for  the  optimization  of  E  [3,21],  have  also  been  applied  to  the 
minimization  of  storage  cost  [22]  for  both  trees  and  diagrams;  however,  such  procedures 
are  of  exponential  complexity.  Little  work  appears  to  have  been  done  on  the  optimiza¬ 
tion  of  rj  and  H. 


4.  Compatibility  Between  Optimisation  Criteria 

In  this  section,  attention  is  focused  on  relationships  between  existing  optimizaion  criteria 
for  the  construction  of  decision  trees. 


4.1  Definitions 


Given  a  function,  f,  and  an  optimization  criterion,  w,  let  Tf  denote  the  set  of  all  tree 
representations  for  f  which  optimize  w. 


Definition  3.  Let  F  be  a  class  of  functions  and  x b,  u  two  optimization  criteria.  Then  we 
say  that,  for  that  class  of  functions: 

i)  ^  and  w  are  equivalent,  denoted  if  V/  eF,  Tf  =  Tfi. 
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ii)  i>  is  a  special  case  of  u>,  denoted  j>  <=  u,  if  V/  €-F,  Tf  D 

iii)  ip  and  w  are  compatible  if  V/  €F,  Tf  pj  Tf  5*^  0.  D 

Moreover,  we  shall  say  that  rp  and  w  are  strictly  equivalent  if  they  are  equivalent  and  the 
ordering  of  all  the  tree  representations  for  any  function  in  F  is  the  same  under  both  cri¬ 
teria. 

It  can  now  be  shown  that,  in  almost  all  cases,  criteria  are  pairwise  incompatible  even 
in  severely  restricted  classes  of  functions. 

Results 

We  first  examine  the  case  that  is  least  conducive  to  incompatibilities,  namely  that  of 
completely  specified  Boolean  functions  with  uniform  probability  distribution  and  unity 
costs.  Figure  2  summaries  the  findings  for  that  class  of  functions  (a  0  entry  means  that 
the  respective  criteria  are  incompatible  and  a  blank  entry  indicates  that  the  exact  rela¬ 
tionship  is  unknown).  Thus,  most  criteria  are  pairwise  incompatible.  In  particular,  a  is 
not  a  special  case  of  E;  this  can  be  seen  by  examining  the  trees  for  the  Boolean  function 
of  four  variables  given  by  the  formula 

/(* 1»*2>*3»*4)  “  *l*2+a:lzJ+*2a:3*4i 
and  thereby  disproves  conjectures  found  in  [2,  p.115  and  28,  p.104]. 

We  proceed  to  prove  some  of  the  relationships;  other  relationships  appearing  in  Fig¬ 
ure  2,  but  not  explicitly  proved  in  the  text,  are  easily  established  by  similarly  con¬ 
structed  counterexamples. 

Proposition  1.  The  maximum  profile  is  incompatible  with  any  other  measure. 

Proof:  Two  counterexamples  will  be  used.  First,  let  /,  be  the  Boolean  function  of  four 
variables  given  by  the  formula 

/-(*  —  *1*3 +*1*2*4 +  *2*3*4- 

The  trees  with  maximum  profile  have  as  the  first  test  either  Zj  or  z3  and  also  have 
minimal  expected  testing  cost;  the  optimal  tree  for  all  other  measures,  however,  tests  z4 
first  and  is  unique  (except  for  the  minimum  diagram  storage  cost,  which  can  also  be 
attained  by  testing  zt  or  z3  first,  but  with  a  structure  different  from  that  of  the  max¬ 
imum  profile  trees).  The  various  measures  for  the  three  types  of  trees  are  listed  in 
Table  1.  The  maximum  profile  (and,  incidentally,  the  minimum  expected  testing  cost)  is 
thus  incompatible  with  the  minimum  reverse  profile,  the  diagram  storage  cost,  and  the 
total,  normalized,  and  worst-case  testing  costs.  Secondly,  let  /*  be  the  Boolean  function 
of  five  variables  given  by  the  formula 

/*(*„* «, *3, *4, *5)  =  *1*5  +  *1*2*5  +  *2*5 '(*3*4  +  *3*4)  +  (*2  +  *5) ’(*3*4  +  *3*4)- 

The  trees  with  the  maximum  profile  test  z5  first,  while  those  optimal  with  respect  to  all 
other  measures  test  Xj  first,  with  the  results  shown  in  Table  2.  Hence  the  maximum 
profile  is  also  incompatible  with  the  minimum  tree  storage  and  expected  testing  costs.  Q 
The  function  /,  above  also  shows  that  minimizing  the  tree  or  diagram  storage  costs 
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Table  1.  First  counterexample  for  Proposition  1. 


First  test 

leaf  profile 

*4 

(0,0, 0,8,0) 

7 

6 

24 

3. 

3 

3. 

X|  or  x3 

(0,0, 1,4,4) 

8 

6 

30 

3.3 

4 

3. 

t\  or  xs 

(0,0, 2, 2, 4) 

7 

7 

26 

3.25 

4 

2.75 

a 


does  not  optimize  any  other  criterion,  while  /*  yields  the  same  conclusion  for  the 
minimum  worst-case  testing  cost. 

Proposition  2.  The  normalized  testing  cost  is  incompatible  with  any  other  measure 
(except,  possibly,  the  worst-case  testing  cost);  moreover,  minimizing  the  normalized  test¬ 
ing  cost  may  involve  the  introduction  of  redundant  tests. 

Proof:  Let  fe  be  the  Boolean  function  of  five  variables  given  by  the  formula 

fe  (*i,*2.*3»*4»*5)  =  *1*2  +  *2®r3®r4®r5. 

where  0  stands  for  summation  modulo  2.  The  optimal  trees  for  all  measures  except  H 
test  x4  or  x2  first  and  use  no  redundant  test,  while  the  trees  with  minimum  normalized 
testing  cost  may  test  any  variable  first  and,  in  case  i\  or  x2  is  chosen,  use  a  redundant 
test.  Two  diagrams  rooted  in  xt  are  shown  in  Figure  3,  the  left  being  optimal  with 
respect  to  all  criteria  but  H,  and  the  right  being  optimal  for  H;  the  corresponding  meas¬ 
ures  are  listed  in  Table  3.  It  is  noted  that  the  test  of  x&  as  the  right  child  of  the  root  is 
totally  redundant.  Q 

It  is  conjectured  that,  among  the  relationships  with  unknown  status,  several  implica¬ 
tions  hold,  most  particularly  tj  =>  or.  Clearly,  however,  the  introduction  of  non- 
uniform  probability  distributions  or  non-unity  costs  renders  all  measures  pairwise  incom¬ 
patible. 

Be'^re  discussing  the  class  of  all  discrete  functions  with  arbitrary  probability  distri¬ 
butions  and  costs,  we  note  the  results  for  the  class  of  partial  Boolean  functions  with 
unity  costs  and  uniform  probability  distribution  on  the  domain,  which  are  shown  in  Fig¬ 
ure  4.  Except  for  the  obvious  case  of  tree  height  and  minimum  reverse  profile,  no  meas¬ 
ure  is  a  special  case  of  any  other;  in  fact,  almost  all  measures  are  pairwise  incompatible. 
The  three  counterexamples  used  to  establish  results  beyond  those  of  Figure  2  are  omit¬ 
ted  here  for  the  sake  of  conciseness. 

In  the  general  case,  it  is  easily  seen  that  all  measures  are  pairwise  incompatible.  In 
other  words,  one  must  face  the  problem  of  the  choice  of  a  criterion  since,  even  for  the 
simplest  types  of  functions,  it  is  not  generally  possible  to  optimize  two  criteria  simul¬ 
taneously. 
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Table  2.  Second  counterexample  for  Proposition  1. 


Xj  (0,0,1, 1.8,4)  13  8  57  4.07  5  3.5 

x5  (0,0,1,2,2,12)  18  9  78  4.47  5  3.625 


5.  The  Case  Of  Binary  Identification 

The  various  applicable  measures  will  be  examined  first  under  the  assumption  of  unity 
costs  and  uniform  probability  distribution  of  objects.  Under  these  conditions,  the 
storage  cost  of  a  tree  reduces  to  its  number  of  nodes  (which  is  fixed,  as  noted  in  Section 
2),  and  its  expected  testing  cost  is  equal  to  its  normalized  testing  cost,  of  which  the  path 
length  is  a  fixed  multiple.  Hence  it  follows  that  only  four  criteria  are  applicable, 
namely,  the  height,  the  external  path  length,  and  the  minimum  reverse  and  maximum 
leaf  profiles.  The  known  relationships  between  the  four  measures  are  summarized  in 
Figure  5  and  can  be  established  with  a  single  couterexample  as  follows.  Consider  the 
identification  problem  with  five  objects,  {a,b,c,d,e},  and  four  tests,  T1={a},  r2={a,b}, 
7'3={a,b,c},  and  r4={a,b,c,d}.  The  trees  with  maximum  profile  test  Tx  or  TA  first; 
those  with  minimum  reverse  profile  and  minimum  path  length  test  T«  or  Tz  first;  and 
those  with  minimum  height  use  any  test  first  (but  with  different  results  if  the  chosen 
test  is  T,  or  T4).  The  resulting  measures  are  listed  in  Table  4.  The  exact  relationship 
between  the  reverse  profile  and  the  path  length  criteria  is  not  known;  it  is  a  simple 
matter,  however,  to  construct  an  example  which  shows  that  they  are  not  strictly 
equivalent.  The  introduction  of  non-uniform  probabilities  results  in  further  incompatibil¬ 
ities  and  one  more  measure,  the  expected  testing  cost.  In  fact,  the  only  two  measures 
that  are  not  incompatible  are,  trivially,  the  reverse  profile  and  the  height.  Storage  and 
testing  costs  impair  the  usefulness  of  leaf  profiles  (which  do  not  reflect  such  data),  but 
give  rise  to  another  criterion,  the  storage  cost;  all  measures  are  then  pairwise  incompati¬ 
ble. 

Even  with  unity  costs  and  uniform  distribution,  the  decision  problem  for  the  path 
length  measure  is  known  to  be  NP-complete  [10].  The  construction  in  [10]  is  a  straight¬ 
forward  reduction  from  the  exact  cover  by  three  sets  (cf.  [7,  p.53])  and  can  be  used  to 
show  that  the  decision  problems  for  the  reverse  profile  and  the  worst-case  testing  costs 
are  also  NP-complete.  Finally,  the  decision  problem  for  the  maximum  profile  is  clearly 
in  NP,  but  it  is  not  known  to  be  NP-complete.  Using  standard  extension  and  search 
techniques  as  developed  in  [7],  one  can  show  that  the  optimization  problems  for  the 
storage  cost  and  the  total,  expected,  and  worst-case  testing  costs  are  all  NP-equivalent. 
The  optimization  problems  for  the  profiles  are  both  NP-easy  since,  although  no 
polynomial-time  algorithm  is  known  for  ranking  profiles,  one  can  simply  use  successive 
binary  searches  (one  for  each  tree  level)  in  order  to  establish  the  optimal  profile.  (In- 
such  a  process,  only  the  number  of  nodes  at  the  searched  level  is  important,  so  that  one 
may  set  arbitrary  values  at  the  unknown  levels.)  Table  5  summarizes  the  known  results 
about  the  complexity  of  decision  tree  optimization  in  binary  identification  problems. 


Table  3.  Counterexample  for  Proposition  2. 


Tree  leaf  profile  amig _ ^ _ i?mi „  _ hmin  Em-m 

left  (0,0,2,0,0,16)  17  8  84  4.6  5  3.5 

right  (0,0,0,4,0,16)  10  9  02  4.6  5  4. 


6.  An  Assessment  Of  Optimisation  Criteria 

Since  the  optimization  problem  for  binary  identification  is  NP-hard  for  most  criteria,  it 
follows  that  the  general  problem  of  optimization  for  (partial)  functions  is  also  NP-hard. 
However,  there  are  large  classes  of  functions  for  which  the  optimization  problem  is  well- 
solved  by  the  dynamic  programming  algorithm  mentioned  in  Section  3,  namely  those 
functions,  the  specification  of  which  requires  an  input  of  length  exponential  in  the 
number  of  variables.  Table  6  shows  the  complexity  of  optimization  of  each  criterion  in 
both  cases  of  exponential-  and  polynomial-length  inputs.  It  is  noted  that  the  optimizar 
tion  of  normalized  and  expected  testing  costs  under  polynomial-length  inputs  is  not 
known  to  be  NP-easy:  the  arbitrary  probability  distribution  and  number  of  test  out¬ 
comes  enormously  increases  the  number  of  possible  values,  to  the  point  where  even  a 
binary  search  requires  exponential  time. 

The  difficulty  of  optimizing  the  normalized  testing  cost  and  its  erratic  behavior  (as 
exemplified  in  Proposition  2,  where  the  addition  of  a  redundant  test  actually  lowers  the 
normalized  testing  cost),  make  it  an  undesirable  criterion.  Both  leaf  profile  criteria  lack 
generality,  in  that  they  cannot  easily  take  into  account  arbitrary  costs  or  probability 
distributions;  therefore,  they  too  are  inappropriate  measures,  except  in  special  situa¬ 
tions.  Finally,  the  tree  storage  cost  is  not  an  accurate  reflection  of  actual  memory  or 
hardware  requirements,  because  the  diagram  storage  cost  is  never  larger  and  often  much 
smaller.  For  instance,  a  modulo  2  sum  of  n  binary  variables  requires  0(2”)  tree  nodes, 
but  only  0(n)  diagram  nodes.  The  diagram  storage  cost  is  a  more  relevant  measure  of 
implementation  problems. 

Of  the  three  measures  of  testing  cost,  only  h  and  E  are  concerned  with  the  perfor¬ 
mance  of  a  tree  representation.  The  total  testing  cost,  rj,  does  not  make  use  of  the  pro¬ 
bability  distribution,  nor  does  it  measure  a  worst-case  extreme.  Although  it  is  of 
interest  for  binary  identification  problems  as  a  measure  of  the  cost  incurred  in  producing 
each  output  of  the  function  exactly  once,  it  does  not  generally  correspond  to  practical 
concerns  that  one  has  about  a  function.  The  worst-case  testing  cost,  h,  can  be 
efficiently  minimized  and  is  certainly  relevant  in  practical  problems;  however,  it  lacks, 
discrimination  power.  Rivest  [23]  has  shown  that  almost  all  (in  the  asymptotic  sense) 
Boolean  functions  are  exhaustive,  i.e.,  have  maximal  worst-case  testing  cost,  a  result 
that  we  strengthen  in  the  Appendix  by  proving  that  all  symmetric  and  all  linearly  separ¬ 
able  Boolean  functions  are  exhaustive.  Thus  the  worst-case  testing  cost  does  not 
discriminate  between  most  Boolean  functions  and,  within  the  sets  of  symmetric  and 
threshold  functions,  it  does  not  discriminate  between  any  functions. 


Table  4.  The  counterexample  for  binary  identification. 


First  test 

Leaf  profile 

Height 

Path  length 

Tx  or  Ta 

(0,1, 1,1, 2) 

4 

14 

T2  or  r3 

(0,0, 3, 2,0) 

3 

12 

Tx  or  r4 

(0,1, 0,4,0) 

3 

13 

The  preceding  considerations  indicate  that  the  expected  testing  cost,  E,  is  the  more  gen¬ 
erally  useful  measure  of  decision  tree  performance,  while  the  diagram  storage  cost,  A 
a  relevant  measure  of  decision  tree  implementation  costs.  These  measures  are  examined 
in  further  detail  in  the  following  section. 

6.1  The  expected  testing  cost  E 

Given  an  intrinsic  function  of  n  variables,  f  {xx,...,xn),  for  which  testing  variable  x,- 
incurs  cost  eit  the  expected  testing  cost  of  any  tree  representation,  T,  of  f  is  bounded  by 

min  {cf  |  l<i<n}  <  E(T)  <  fj e{ 

«— 1 

These  rather  loose  bounds  can  be  tightened  [16]  to  the  following: 

/(/)  +  min  {/,(*)  |  l<«<n}  <  E[T)  <  fjc.  -max  {/,(*,)  |  l<.<n}, 

•—1 

where  1(f)  is  the  intrinsic  cost  of  the  function  and  l/(x,)  is  the  loss  of  variable  x{  with 
respect  to  the  function  [IS]. 

In  fact,  Boolean  functions  with  unity  costs  and  uniform  probability  distributions 
require  an  expected  number  of  tests  that  converges  to  n;  this  can  be  shown  as  follows. 
Let  B(n)  be  the  number  of  Boolean  functions  of  n  variables  and  let  J(n)  be  the  number 
of  those  that  are  intrinsic;  then 

F(n)  =  22"  and  J(»)=  SMr'W). 

i—0 

But  almost  all  Boolean  functions  are  intrinsic,  so  that  we  have 

lim  J(n)/F(a)  =  1 
*-•00 

with  rapid  convergence.  Now,  the  expected  value  of  E  for  a  function  of  n  variables, 
E(n),  must  be  at  least  as  large  as 

E(n-l)  for  non-intrinsic  functions  and  equal  to  1+E(n-1)  otherwise;  hence,  we  have  the 
recurrence 

E(n)  >  [(F(n)-J(n))  E(n-l)  +  /(n)(l+E7n-l))]/F(n)  =  E{n- 1)  +  J(n)/F(n). 

Since  J(n)/F(n)  rapidly  converges  to  1,  the  expected  value  of  E  is  essentially  equal  to  n 
for  large  values  of  n. 


Table  5.  The  complexity  of  optimal  binary  identification. 


Criterion  tlaa^ma  Eam  hmia  reverse  maximum 


Complexity  NP-  NP-  NP-  NP-  NP-  NP- 


eqvlnt  eqvlnt  eqvlnt  eqvlnt  eqvlnt  easy 

This  result,  however,  does  not  indicate  that  minimizing  the  expected  testing  cost  is  use¬ 
less,  because  the  presence  of  non-uniform  costs  and  probabilities  results  in  the  large 
range  of  values  described  by  the  bounds  given  above.  Moreover,  the  expected  testing 
cost  can  be  efficiently  minimized,  as  indicated  in  Table  6. 

The  expected  testing  cost  is  the  most  frequently  used  criterion  in  the  literature.  In 
software  applications,  in  particular,  it  is  often  of  more  interest  to  optimize  the  running 
time  of  a  routine  than  to  minimize  its  memory  requirements.  More  generally,,  one 
expects  to  find  this  criterion  useful  whenever  a  premium  is  placed  on  performance  as 
opposed  to  acquisition  cost. 

0.2  The  diagram  storage  cost 

The  number  of  internal  nodes  of  a  binary  decision  diagram  has  been  extensively  studied 
in  [13],  where  diagrams  are  called  programs.  It  is  shown  that  0(2*/n)  nodes  are- 
sufficient  to  represent  any  Boolean  function  of  n  variables  as  compared  with  0(2")  for 
trees.  This  result  is  easily  extended  to  show  that  0(k*/n)  nodes  are  sufficient  to 
represent  any  function  of  n  k-ary  variables  (versus  0{kn)  for  trees)  [18].  Thus,  in  partic¬ 
ular,  a  decision  diagram  is  as  succinct  a  representation  of  Boolean  functions  as  is  a 
simplified  and  factored  Boolean  formula. 

The  minimization  of  the  diagram  storage  cost,  however,  is  a  difficult  task.  It  cannot 
be  done  on  a  leaves-to-root  scan  because  it  requires  that  all  subtrees  be  examined  simul¬ 
taneously.  This  precludes  the  use  of  dynamic  programming  and  necessitates  some  form 
of  top-down,  backtracking  method.  Hence  it  must  be  suspected  that  the  problem,  which 
is  clearly  NP-easy,  is  also  NP-hard  (and  thus  NP-equivalent)  even  under  exponential- 
length  inputs.  Indeed,  the  only  existing  algorithm,  a  branch-and-bound  procedure  [22], 
may  exhibit  exponential  behavior  by  searching  through  almost  all  possible  diagrams  for 
a  function. 

The  diagram  storage  cost  has  mostly  been  used  in  connection  with  hardware  imple¬ 
mentation  of  decision  trees,  such  as  multiplexer  networks  for  Boolean  functions  [4,13].  In 
general,  one  expects  to  use  this  criterion  whenever  a  premium  is  placed  on  acquisition  or 
construction  costs  or  when  special  constraints  decrease  the  value  of  other  measures.  (An 
example  of  the  latter  is  a  synchronicity  constraint,  which  requires  all  evaluations  to  take 
the  same  time  and  thus  reduces  the  expected  testing  cost  to  the  worst-case  testing  cost, 
h.  When  h  is  known  to  be  maximal,  as  is  the  case  with  most  Boolean  functions,  perfor¬ 
mance  measures  become  altogether  irrelevant.) 


Table  6.  Complexity  of  optimization  criteria. 


Criterion 


0inia 
0min 
*7  min 

^rain 

min.  rev.  profile 
maximum  profile 


Input  size  in  function  of  number  of  variables 


^exponential 

low  polynomial 
? 

? 

? 

low  polynomial 
low  polynomial 
low  polynomial 
low  polynomial 


polynomial 

NP-equivalent 

NP-equivalent 

NP-equivalent 

NP-hard 

NP-equivalent 

NP-hard 

NP-equivalent 

NP-easy 


7.  Summary 

Several  measures  used  for  the  assessment  of  decision  trees  have  been  reviewed.  It  has 
been  shown  that  they  are  pairwise  incompatible  in  all  but  a  few  cases.  This  disproves 
some  conjectures  regarding  the  simultaneous  optimization  of  those  measures.  Promising 
measures  have  been  individually  examined  and  two  new  results  proved  concerning  the 
behavior  of  one  measure  in  classes  of  Boolean  functions.  Based  on  the  results  presented, 
two  measures,  one  concerning  the  run-time  cost  and  the  other  the  retention  cost  of 
trees,  appear  to  be  the  most  generally  applicable  at  this  time.  The  complexity  of  deci¬ 
sion  tree  optimization  under  these  two  criteria  was  examined  in  detail. 
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Figure  1.  A  decision  diagram  (a)  and  its  associated  decision  tree  (b) 
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Figure  4.  Known  relationships  between  the  eight  measures  applicable 

to  binary  decision  trees  for  partially  specified  Boolean  functions. 
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Figure  5.  Known  relationships  between  the  four  measures  applicable 
to  binary  identification  trees. 
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Symmetric  and  Threshold  Boolean  Functions  Are  Exhaustive 

BERNARD  M.  E.  MORET.  MICHAEL  G.  THOMASON, 
and  RAFAEL  C.  GONZALEZ 

Abstract— ' The  worst-case  number  of  variable  evaluations  (testing  cost)  of 
Boolean  functions  is  examined.  Following  up  on  a  result  by  Rites!  and  Vuillemiu. 
we  show  that  all  symmetric  as  well  as  all  linearly  separable  Boolean  functions 

are  exhaustive,  that  is,  have  a  pessimal  worst-case  testing  cost. 

% 

Index  Terms— Argument  complexity,  decision  tree,  multiplexer  tree, 
threshold  function,  worst-case  testing  cost. 

I.  Introduction 

Rivest  and  Vuitlcmin  [S]  have  shown  that  almost  all  (in  the  as¬ 
ymptotic  sense)  Boolean  functions  are  exhaustive,  that  is,  require  in 
at  least  some  cases  that  all  of  their  variables  be  evaluated  in  order  to 
find  the  function's  value.  Thus  it  is  natural  to  suspect  that  there  exist 
significant  classes  of  Boolean  functions  in  which  every  function  is 
exhaustive.  In  this  correspondence,  we  identify  two  such  classes: 
symmetric  functions  and  linearly  separable  (also  known  as  threshold) 
functions. 
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This  result  has  implications  in  logic  design,  fault  analysis,  pattern 
recognition,  and  analysis  of  algorithms. 

II.  Preliminaries 

Let/(jt|,  ■  •  • ,  x„)  be  a  Boolean  function  of  n  variables  (arguments). 
A  variable,  xt  of /,  is  redundant  if  the  function  is  independent  of  the 
value  of  that  variable,  i.e.,/1*,-o  ■  A  function  without  re¬ 
dundant  variables  is  said  to  be  intrinsic.  A  binary  decision  tree  is  a 
model  of  the  sequential  evaluation  of  a  Boolean  function,  wherein  the 
value  of  a  variable  is  determined  and  the  next  action  (to  choose  an¬ 
other  variable  to  evaluate  or  to  emit  the  value  of  the  function)  is 
chosen  accordingly.  Decision  trees  have  found  numerous  applications 
in  pattern  recognition,  taxonomy,  logic  design,  decision  table  pro¬ 
gramming,  fault  detection,  and  analysis  of  algorithms  (see  [4]  for 
further  definitions  and  references). 

From  the  definition,  we  see  that  the  height  of  a  decision  tree  cor¬ 
responds  to  the  maximum  number  of  variables  that  had  to  be  evalu¬ 
ated  in  order  to  determine  the  value  of  the  function.  The  argument 
complexity  of  a  function  is  then  defined  as  the  minimum  height  over 
all  decision  tree  representations  of  that  function.  Thus,  the  argument 
complexity  of  a  function  is  the  minimum  number  of  variables  that 
must,  in  the  worst  case,  be  examined  before  the  value  of  the  function 
can  be  determined.  A  function  is  said  to  be  exhaustive  if  its  argument 
complexity  is  maximal  (equal  to  the  total  number  of  variables).  Rivest 
and  Vuillemin  [5]  have  used  an  elegant  counting  method  to  show  that 
almost  all  Boolean  functions  are  exhaustive.  In  the  following  we  prove 
that  all  intrinsic1  symmetric  and  linearly  separable  Boolean  functions 
are  exhaustive,  using  the  specific  properties  of  those  classes. 

III.  The  Main  Results 

A  Boolean  function  of  n  variables. /(xi,  •  •  • ,  x„),  is  said  to  be 
symmetric  if  and  only  if  (iff),  for  each  permutation,  a,  over  n  let¬ 
ters, 

/(*»(i>.  •  •  • ,  x,(„))  =/(x,,  •  •  • ,  xK). 

Equivalently,  a  function  is  symmetric  iff  there  exists  a  set  of  k 
numbers  (k  <  n),  |fl|,  •  •  • ,  o*|,  where  0  <  <i|  <-  ■  ■  <  a*  <  n,  such  that 
the  function  is  equal  to  1  exactly  when  a,  of  its  variables  arc  equal  to 
l,  for  any  i,  \<i  <k  (3).  Such  a  function  has  a  single  decision  tree 
structure;  in  particular,  whenever  n  —  a,  variables  have  been  found 
equal  to  0,  the  remaining  a,  variables  must  all  be  tested,  since  the 
function  will  be  equal  to  I  if  all  are  equal  to  I.  This  proves  the  fol¬ 
lowing  result. 

Theorem  I:  All  (intrinsic)  symmetric  Boolean  functions  are  ex¬ 
haustive.  D 

Now  let  P  be  the  defining  property  of  a  class  of  functions  such  that, 
if/possesses  P,  then  both/(^.oand/|x,.|  possess/1,  for  any  choice 
of  xr,  in  other  words,  P  is  preserved  by  Shannon's  decomposition.  We 
then  have  the  following  characterization  of  exhaustiveness. 

Proposition:  All  intrinsic  functions  in  the  class  defined  by  P  are 
exhaustive  iff,  in  any  Shannon's  decomposition,  at  least  one  of  their 
two  subfunctions  is  intrinsic. 

Proof:  The  only  if  part  follows  immediately  from  the  definition 
of  exhaustive:  if/ is  a  function  of  n  variables  and  neither  of  its  re¬ 
strictions  with  respect  to  some  variable  x  is  intrinsic,  then  each  of  the 
restrictions  has  a  decision  tree  of  height  no  greater  than  n  -  2.  so  that 
/ has  a  decision  tree  of  height  n  -  I  rooted  in  x  and  hence,  is  not  ex¬ 
haustive.  For  the  if  part,  we  use  induction  on  the  number  of  variables 
of  the  functions.  All  intrinsic  functions  of  one  variable  are  trivially 
exhaustive.  Assume  then  that  all  intrinsic  functions  of  n  —  I  variables 
or  less  that  answer  the  theorem's  hypotheses  are  exhaustive.  Consider 
a  function /of  n  variables  that  answers  the  theorem's  hypotheses.  Any 

1  In  the  asymptotic  sense  again,  almost  all  Boolean  functions  are  in¬ 
trinsic. 
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decision  Iree  for  /  starts  by  testing  one  of  the  n  variables.  By  as¬ 
sumption,  for  any  variable  x  tested  at  the  root,  one  of  the  restrictions 
/|jr-o  and/]x_i  is  intrinsic.  By  inductive  hypotiMsis,  that  restriction 
is  also  exhaustive,  since  it  is  a  function  of  n  —  1  variables  that  answers 
the  theorem’s  hypotheses.  Thus  all  decision  trees  for  that  restriction 
have  height  n  —  1;  but  then  the  decision  tree  for/ rooted  in  x  has 
height  n.  Since  this  holds  for  any  choice  of  x,/is  exhaustive.  □ 

We  first  consider  the  class  of  unate  functions— those  functions 
representable  by  a  Boolean  formula  where  no  variable  appears  in  both 
complemented  and  uncomplemented  form.  Since  decision  trees  are 
invariant  under  complementation  of  variables,  it  can  be  assumed 
without  loss  of  generality  that  all  variables  are  uncomplemented;  this 
defines  the  class  of  positive  unate  functions,  which  are  monotone 
increasing  [3].  Both  properties  are  easily  seen  to  be  preserved  by 
Shannon’s  decomposition. 

Let  /Or  i,  •  •  • ,  x,)  be  an  intrinsic  positive  unate  function;  then/is 
exhaustive  iff,  for  each  x,,  either /|x,-o  or/|X(.i  is  intrinsic,  that  is, 
there  cannot  be  found  xj,  x*(y,  &  *  0  such  that/)  x,-o  does  not  de¬ 
pend  on  Xj  and/|x,.i  does  not  depend  on  x*.  Without  loss  of  gener¬ 
ality,  let  i  =  1,/  -  2,  and  k  «*  3,  and  let  x  stand  for  (x4,  ■  •  • ,  x»).  Then 
/is  not  exhaustive  iff 

A 0.0.x3,x)  =/(0,  l,xj,£) and/(l,x2,0,£)  =/(l.x2, 1,2). 
Since  /is  monotone  increasing,  it  must  be  the  case  that 
A 1 .  x2. 1 , 2)  £  /[0, 0,  x3, 2). 

so  that,  by  topological  sorting,  the  following  relations  arc  obtained 
/(l.  1.1,2)  -/(1. 1.0,2)  ^ 
/(1.0,1.2)=/(1.0,0,2)£ 

/f0.‘l,  1,2) -/(0, 0,1,2)  * 

/(0,1.0.2)=/(0,0,0,2). 

Let  the  four  pairs  of  points  above  be  denoted  a,  b,  c,  and  d  in  that 
order.  For  any  choice  of  x,  these  four  pairs  can  be  mapped  to  the  same 
value  or  to  two  distinct  values  (0  and  I ),  with  the  following  parti¬ 
tions. 

i)  (abed)  mapped  to  the  same  value;  then  X\,  x2,  and  xj  are  re¬ 
dundant  for  that  choice  of  2- 

ii)  ( abc )  mapped  to  1  and  (d)  to  0;  then  x2  is  redundant  for  that 
choice  of  x. 

iii)  (ah)  mapped  to  1  and  ( cd )  mapped  to  0;  then  x2  and  xj  are 
redundant  for  that  choice  of  2- 

iv)  4 a )  mapped  to  1  and  (bed)  toO;  then  xj  is  redundant  for  that 
choice  of  2- 

The  monotone  property  excludes  any  other  choice.  This  shows  that 
all  unate  functions  of  no  more  than  three  variables  are  exhaustive, 
since  then  there  is  no  choice  for  x  and  one  of  the  four  partitions  above 
must  exist,  contradicting  the  assumption  of  intrinsicalness.  At  the 
same  time,  it  shows  how  to  construct  a  nonexhaustive  unate  function 
of  four  variables;  specifically,  it  is  sufficient  to  find  s!  >  2*  such  that 
/(x,  ,  x2,  xj,  2')  is  partitioned  according  to  ii)  and/(xi,  x2,  xj,  2*) 
according  to  iv),  since  then  x2  is  redundant  in  one  case  and  xj  in  the 
other,  but  both  are  necessary  overall.  One  such  function  is  given  by 
the  formula 

/fX|,X2,X3,X4)  =  X|X2  +X|X4  +x2x4. 

It  is  easily  verified  that  this  function  has  decision  tree  representations 
of  height  3. 

Thus,  not  all  unate  functions  are  exhaustive;  however,  a  subclass 
of  these  functions  does  possess  the  property. 

A  Boolean  function, /(X|,  •••,  x„),  is  linearly  separable  (is  a 


threshold  function)  iff  there  exists  a  set  of  weights,  |h>,,  •  •  • ,  »•„},  and 
a  threshold,  7,  such  that  the  function  evaluates  to  I  exactly  when 

t  WiXi  *  T. 

/-i 

Again,  it  is  easily  seen  that  linear  separability  is  preserved  under 
Shannon  decomposition.  Since  unate  functions  can  be  taken  to  be 
positive,  weights  and  threshold  are  assumed  positive  without  loss  of 
generality.  Let  jy  stand  for  (w4,  •  •  • ,  w„)  and  x'  for  the  transpose  of 
2;  substituting  weights  into  the  four  pairs  of  function  points  yields 

(a)  (w,  +  vv2  +  *y  •  2'i  **'1  +  w2  +  wj  +  vy  •  x'), 

(b)  (wi  +  >y  •  x';  W|  +  wy  +  jy  •  2'). 

(c)  (*vj  +  iy  •  x';  h>2  +  W3  +  w  ■  x'), 

(d)  (jy  •  x'\  w2  +  jy  •  x'), 

where  any  weight  sum  in  (a)  is  no  smaller  than  any  weight  sunt  in  (b), 
and  so  on.  As  seen  above,  a  function  will  not  be  exhaustive  iff  x'  > 
2*  can  be  found  such  that  f(x  1,  x2,  X3, 2')  is  partitioned  as  (abc)(d) 
and/(xi,  x2,  xJt  2*)  as  (a)(bcd).  Thus,  the  smallest  sums  of  weights 
in  any  of  (a),  (b),  and  (c)  must  be  larger  than  the  largest  sums  of 
weights  in  (d)  when  x'  is  chosen  (since  the  first  are  above  the  threshold 
while  the  second  are  below);  similarly,  the  smallest  sums  of  weights 
in  (a)  must  be  larger  than  the  largest  sums  of  weights  in  any  of  (b), 
(c),  or  (d)  when  x’  is  chosen.  Using  only  the  extremal  sums— those 
closest  to  the  threshold  value,  this  implies,  for  the  first  partition, 
(abc)(d), 

Wi  +  w2  +  »y  ■  x'1  >  w,  +  h>3  +  w  ■  2''. 
and  for  the  second,  (a)(bcd), 

w>3  +  w  •  x"  >  w2  +  w  •  x'1. 

The  first  inequality  yields  w2  >  while  the  second  implies  w2  <  wj, 
a  contradiction.  Hence,  2'  and  x"  cannot  be  found,  and  we  have  the 
following  result. 

Theorem  2:  All  (intrinsic)  linearly  separable  Boolean  functions 
are  exhaustive.  D 

IV.  Conclusion 

Since  the  argument  complexity  of  a  function  determines  the 
worst-case  performance  of  a  sequential  evaluation  algorithm,  our 
results  show  that  no  optimization  is  possible  for  symmetric  and 
threshold  Boolean  functions.  In  particular,  the  total  delay  of  a  mul¬ 
tiplexer  (or  sequential  lookup)  implementation  of  such  functions  [2] 
is  fixed  by  the  number  of  variables  only.  Similar  remarks  hold  for 
sequential  evaluations  of  linear  decision  functions  in  pattern  recog¬ 
nition  [6],  longest  paths  through  fault-trees  (which  usually  describe 
unate— and  often  linearly  separable— functions)  [I],  and  software 
implementations  of  decision  tables  [4J. 
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Boolean  difference  techniques  for  time-sequence  and  common-cause  analysis 

of  fault-trees.* 
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ABSTRACT 

Fault  trees  are  a  major  model  for  the  analysis  of  system  reliability.  In 
particular,  Boolean  difference  methods  applied  to  fault  trees  provide  a 
widely  used  measure  of  subsystem  criticality.  This  paper  considers  the  gen¬ 
eralization  of  the  fault-tree  model  to  time-varying  systems  and  how  time- 
dependent  Boolean  differences  can  be  used  for  the  analysis  of  such  systems. 
In  particular,  ratable  partial  Boolean  differences  are  shown  to  provide 
maximal  and  minimal  solution  sets  for  sensitization  conditions.  A  method 
of  common-cause  failure  analysis  based  on  partial  time-dependent  Boolean 
differences  is  developed,  which  allows  the  study  of  failures  due  to  repeated 
occurrences,  at  different  times,  of  the  same  phenomenon.  Finally,  the 
application  of  those  methods  to  systems  with  repair  is  studied;  it  is  shown 
how,  under  certain  assumptions  of  independence,  steady-state  distributions 
can  be  used  for  the  analysis  of  system  faults. 
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1.  Introduction 


Fault  tree  analysis  is  a  method  of  major  importance  in  reliability  and  safety  stu¬ 
dies  [2,7,10,14].  A  fault  tree  is  a  representation  (using  logic  operations)  of  a  Boolean 
function,  the  structure  function  of  the  system,  which  describes  the  set  of  elementary 
events  (subsystem  failures)  necessary  to  make  the  system  fail.  When  the  structure  func¬ 
tion  is  monotone  non-decreasing,  that  is,  when  a  system  is  such  that  a  failure  of  an 
additional  subsystem  cannot  improve  the  system’s  status,  the  structure  function  is 
called  s-coherent  [2].  In  this  article,  we  restrict  ourselves  to  such  functions. 

Of  particular  importance  in  system  reliability  studies  is  the  determination  of  a 
component’s  criticality,  that  is,  of  a  component's  influence  on  the  behavior  of  the  sys¬ 
tem.  Among  several  criticality  measures  [8],  the  most  commonly  used  is  Birnbaum’s 
importance  measure,  which  is  simply  the  probability  that  the  system  is  in  a  state  in 
which  the  functioning  of  the  component  completely  determines  that  of  the  system. 
(That  is,  the  system  fails  if  the  component  fails,  works  if  the  component  works.)  This 
measure  can  be  obtained  by  using  the  Boolean  difference  operator  [12],  a  powerful 
analytical  tool  for  combinational  logic  expressions,  in  particular  when  a  large  number  of 
variables  is  involved.  Recall  that  the  Boolean  difference  of  a  Boolean  function,  / ,  with 
respect  to  one  of  its  variables,  z,  is  the  function 

=  f 1  *— o®/  |  It 

where  ©  stands  for  exclusive-or  and  /  |  ,_o  denotes  the  restriction  of  /  to  that  part  of 
its  domain  where  z  takes  the  value  0. 

There  is  interest  in  fault  tree  analysis  also  in  systems  in  which  the  configuration  of 
components  required  to  cause  lailure  changes  at  a  finite  number  of  discrete  points  dur¬ 
ing  the  interval  in  which  the  system  is  in  operation.  The  term  of  phased  mission  [6]  has 
been  used  to  describe  such  time-varying  systems.  In  effect,  variations  with  time  force 
one  to  consider  sequential  rather  than  combinational  logic  functions  as  the  structure 
functions  for  the  system’s  description.  In  such  systems,  the  importance  of  a  component 
can  only  be  measured  in  each  separate  phase  by  conventional  methods,  since  the  meas¬ 
ure  is  itself  time-dependent.  However,  it  is  important  to  evaluate  the  effects  of  indivi¬ 
dual  components  over  the  system’s  operation  time,  as  well  to  attempt  to  identify  under¬ 
lying  causes  (known  as  common  causes  [5,15])  for  time*dependent  failures. 
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In  the  following,  we  show  how  the  concept  of  time-dependent  Boolean  differences 
[0]  can  be  used  to  develop  methods  for  the  analysis  of  sequential  (rather  than  combina¬ 
tional)  structure  functions.  Methods  for  the  determination  of  sensitization  conditions, 
path  dependences,  and  measures  of  importance  are  illustrated.  We  show  that  partial 
Boolean  differences,  taken  with  respect  to  suitable  subfunctions,  allow  the  determination 
of  maximum  and  minimum  sets  of  conditions  for  sensitization  and  criticality  measure¬ 
ments.  We  then  develop  a  new  method  for  common  cause  analysis  using  partial  Boolean 
differences  and  illustrate  it  on  a  phased  mission  example.  We  conclude  by  showing  how 
the  above  methods  can  be  applied  to  systems  with  repair,  using  steady-state  distribu¬ 
tions  under  mild  assumptions  of  independence. 


2.  Time-dependent  Boolean  differences,  sequential  functions,  and  phased  missions 

Time-dependent  Boolean  differences  were  introduced  in  [9]  as  a  tool  for  the 
analysis  of  sequential  logic  functions.  A  time  superscript  is  used  on  switching  expressions 
(single  variables  or  more  complex  expressions)  to  denote  their  value  during  a  specific 
time  interval  relative  to  some  reference  starting  point.  In  effect,  the  superscripts  create 
distinct  variables  for  each  time  reference. 


To  illustrate  these  concepts  briefly  in  the  original  context  of  sequential  digital  net¬ 
works,  we  consider  the  circuit  of  Figure  1,  composed  of  an  AND  gate,  an  OR  gate,  and 
two  D-type  (delay)  flip-flops.  The  value  of  the  primary  output,  Z ,  at  time  t,  Zl,  is  given 
by  the  sequential  Boolean  function: 


z*  =  x\  +X[~2-Xt2-2. 


To  determine  the  conditions  which  make  Z *  dependent  upon  X[  (that  is,  such  that  X\ 
is  critical),  we  compute  the  Boolean  difference  of  Z *  with  respect  to  Xj| : 


dZ l 
dX\ 


=  X\~2  + 


x4~2. 


The  solutions  of  dZ*  fdX\  =  1  give  the  necessary  and  sufficient  conditions,  both  in  logi¬ 
cal  value  and  timing,  for  X\  to  be  critical  for  Z\  namely  Xt=0  or  X2=0  at  time  t-2. 
For  dependence  of  Z*  on  X{-2 ,  we  compute 


dZ‘ 

dX\- 


thus  establishing  the  conditions  Xj=0  at  t  and  X2=l  at  t-2.  Note  that  dZ* {dX[=0  for 


-  4- 


87 


rj£t  and  Tf^t-2,  reflecting  the  fact  that  Xt  can  only  influence  Z%  at  times  t-2  and  t. 

Specific  path  dependencies  can  be  isolated  by  partial  Boolean  differences.  For 
example,  if  the  dependence  of  Z  on  via  the  path  Xl—*A—*B—*C—*Z  is 
desired,  we  compute  the  chain  of  derivatives: 

dAr*  dBr*  dCu  dZr* 
dXl'  dAr*  dBr*  dCu' 

wherein  the  linking  of  the  time  sequence  requirements  is  reflected  in  the  time  super¬ 
scripts.  Computing  these  derivatives  at  actual  time  intervals  yields: 

dX\  dAl 


C  * 


dCt+l  ==  dZ*+2  =  ««+2 

dB‘+1  dC*+2 

Thus  the  chain  yields: 

dX  i+2  ‘  1 

in  accordance  with  our  earlier  computation  of  dZ* /dX[~2 . 

As  noted  above,  the  concept  of  a  system’s  requirements  changing  at  a  finite 
number  of  points  over  an  interval  of  interest  essentially  converts  its  structure  function 
into  a  sequential  logic  function,  so  that  the  appropriate  analytical  method  becomes  the 
time-dependent  Boolean  difference.  In  each  time  interval,  the  structure  function  is  s- 
coherent.  We  illustrate  these  concepts  by  ak  plying  them  to  a  simplified  example  of 
phased  mission,  due  to  Esary  and  Ziehms  [6]. 

The  example  considers  the  interaction  of  a  fire  department,  which  operates  three 
vehicles,  a  large  fire  engine  (M),  a  tanker  ( T ),  and  a  light  truck  ( L ),  with  a  chemical 
plant,  the  safety  equipment  of  which  consists  of  a  sprinkler  system  (5),  a  hydrant  (/f), 
and  a  special  chemical  fire  extinguisher  system  ( F ).  A  fire  at  the  plant  can  be  decom¬ 
posed  in  three  phases.  In  the  initial  stage,  the  large  engine  or  the  light  truck  combined 
with  the  sprinkler  system  will  allow  time  for  evacuation.  In  the  second  stage,  the  special 
chemical  extinguisher  system  is  needed  to  contain  the  fire,  together  with  either  the  large 
engine  or  the  light  truck;  the  needed  water  can  be  supplied  by  the  hydrant  or,  if  neces¬ 
sary,  by  the  tanker  through  the  large  engine’s  pumps.  In  the  last  phase,  the  fire  is 


V.V.V.V ' 


>V.V.V,V,'/ V 


brought  under  control  by  the  special  system  or  by  the  large  engine;  again,  the  needed 
water  can  come  from  the  hydrant  or  the  tanker. 

Thus  we  have  a  six  component,  t&ree  phase  system,  described  by  the  block 
diagram  of  Figure  2.  The  whole  system  works  iff  each  succeeding  phase  works  in  turn. 
Thus  the  system’s  success  function  is  the  product  of  the  three  phases’  success  functions. 
The  three  phases  are  described  by  the  functions 

Pt  =  S'L*  +  M*, 

P2  =  F*  (  PM*  +  +  L*)) 

Pa^FtHt  +  M*iTl  +H*) 

Hence  the  system’s  success  function  is,  at  time  t: 

Succ*  =  P\~2  P^X  P*3. 

This  is  a  sequential  logic  function.  (Note  that  it  is  a  particularly  simple  type  of  phased 
mission,  since  the  several  phases  do  not  mix  or  interact.)  We  can  analyze  each  phase 
separately,  such  as  by  finding  under  which  conditions  the  availability  of  the  large  fire 
engine,  M,  is  critical  in  phase  1: 

«  1  d<SL+M)  =  1  ?+r  -  1  5  or  L  /ail,, 

awl  dm 

However,  we  can  use  time  differences  to  find  under  which  conditions  the  availability  of 
the  same  component  in  phase  1  is  critical  to  the  success  of  the  mission: 

dSucc 1  j  ^ 

dAl*~2  ~ 

(51*2  +  +  L‘-1))  {F'H*  +  A/* •( T*  +  H'))  =  1, 

which  says,  of  course,  that  those  conditions  are  that  either  S  or  L  fails  in  phase  1  while 
phases  2  and  3  are  successful. 

Yet  of  more  interest  is  the  criticality  of  the  same  component  throughout  the  mis¬ 
sion,  under  the  assumption  that  a  failure  at  time  r  implies  that  the  component  stays 
faulty  for  t  >r  (no  repair). 

- - -  =  i  ^  5,-2F,-,L‘-1(F‘//‘  +  FtTt  +  H‘  T')  =  1. 

dLXP-iAP'iM*  1 

This  triple  Boolean  difference  gives  the  conditions  under  which  the  functioning  of 


component  M  is  critical  in  every  phase  of  the  mission;  that  is,  under  these  conditions, 
each  phase  of  the  system  reduces  to  component  M.  Other  conditions  of  interest  include 
those  under  which  the  status  of  a  component  in  some  phase  is  critical,  or  those  under 
which  specific  combinations  of  components  become  critical;  the  next  section  develops  a 
general  approach  to  the  derivation  of  such  conditions  with  Boolean  differences. 

3.  Some  properties  of  Boolean  differences 

When  analyzing  a  system,  it  is  often  desired  to  determine  its  sensitivity  to  various 
modes  of  failure  of  its  components.  While  standard  (multiple)  Boolean  differences  allow 
the  determination  of  a  system’s  sensitivity  to  a  particular  sequence  of  component 
failures,  they  cannot  provide  the  answer  to  such  questions  as:  “If  one  of  two  components 
fails,  under  which  conditions  will  the  system  certainly  fail?  possibly  fail?” 

In  order  to  answer  such  questions,  we  turn  to  Boolean  differences  with  respect  to 
subfunctions,  a  generalization  of  the  conventional  Boolean  differences  with  respect  to 
variables.  The  Boolean  difference  of  the  function,  f(Xlt  .  .  . , Xn)  with  respect  to  the 
subfunction  g(Xix>  .  .  .  ,Xit),  where  it<n  and  l<»y<n,  is  defined  as 

which  definition  exactly  parallels  that  of  standard  Boolean  differences  (indeed,  a  single 
variable  is  a— very  simple-subfunction).  In  the  definition,  we  have 

/if- »  —  U  ^1  (*v  •  •  •  ,*,)> 

where  the  union  (inclusive-or)  is  taken  over  all  k-tuples  which  are  minterms  of  g  (i.e., 
which  make  g  take  the  value  of  1).  For  instance,  with  f(A,B,C),  g(A,B)=A+B,  and 
h(A,B)=A-B,  we  have 

=  /l  yt+B-l©  f\  A+B-0 

=  (/|a-f-i+  /U-o,f=i+  /U-i^-o) 

—  /U-s-i® /U-fl-o 

=  / 1  A—B— 1©  (/ 1  A-B-0+  / 1  X-0,5- 1+  / 1  A-l,5-o)  • 


Because  of  the  symmetric  definition  of  Boolean  differences  and  since  X@Y  —  X®Y,  we 
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clearly  have 

JL  „  *L  _  £.  =  £.. 

dg  i§  dg  ig 

In  particular,  let  5  be  the  success  function  of  a  system,  F=S  its  failure  function,  and 
5|,  F |  the  success  and  failure  functions  of  a  subsystem;  we  then  have 

dS  dS  dF 
dSi  dTt  dF  | 

It  follows  that  whatever  results  are  developed  for  success  functions  remain  unchanged 
for  failure  functions;  thus,  without  loss  of  generality,  we  shall  from  now  on  use  only  suc¬ 
cess  functions. 

Given  a  system,  5,  and  two  components,  X  and  Y,  we  wish  to  know  how  the 
failure  of  one  or  both  of  the  components  will  affect  the  system’s  behavior.  There  are 
three  failure  modes  to  consider: 

i.  both  X  and  Y  originally  work  and  they  fail  simultaneously; 

ii.  at  least  one  of  X  and  Y  originally  works  and  both  eventually  fail; 

Hi.  both  X  and  Y  originally  work  and  at  least  one  of  them  fails. 

The  first  mode  corresponds  to  a  change  from  (X,  Y)=(l,l)  to  (X,  Y)=(0,0);  the  second 
corresponds  to  a  change  in  the  value  of  the  function  g=X+Y ;  and  the  third 
corresponds  to  a  change  in  the  value  of  the  function  h  =X-  Y.  This  leads  us  to  consider 
the  functions: 


i. 

ii. 

iii. 


S  |  x-y-o®  S  |  x-y-iJ 


dS_ 
dg ’ 

dh' 


The  following  result  defines  the  relationship  (illustrated  in  Figure  3)  between  these  and 
other  Boolean  differences. 


Theorem  1:  Let  S  be  the  success  function  of  a  system,  X,  Y  the  success  functions  of 
two  of  its  components.  Then 
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C  * 


>58 


dS 

d(XY) 


I  ~  { 7mr) }  e  s  { =  {si*->'-o®su-)'-i}> 


dS 

d(X+Y ) 


where  {/  }  denotes  the  set  of  minterms  of  /.  Q 


Notice  that  the  conditions  expressed  by  |  — ]  are  such  that  the  failure  of 

l  d(X  Y)  \ 


either  X  or  Y  will  precipitate  that  of  S,  while  those  expressed  by  j  ~  ■  j  are  such 


. .  r  r - ’  r - '  l  d(X+Y)\ 

that  the  failure  of  both  X  and  Y  may  be  necessary  to  cause  that  of  S.  This  is  formal¬ 
ized  in  the  following  result. 


Corollary  1:  |  }  an<*  {  Me  n“n*mum*  respectively  maximum, 

solution  sets  for  the  sensitization  of  5  to  the  subsystems  X  and  Y.  Q 


The  inclusions  stated  in  the  theorem  are  in  general  proper;  however,  one  or  both 
may  degenerate  into  an  equality.  In  this  case,  we  have  the  following  results. 


Corollary  2: 


dS  d2S 


d(XY)  dXdY 


{S|x-y_o}  =  {Six-i.r-o}  D  {5I  x-o.r-ij 


that  is,  if  the  system  works  equally  with  only  one  subsystem  or  the  other  function¬ 
ing,  it  will  work  without  either  subsystem. 


d2S  =  dS 
dXdY  d(X+Y) 


|5lAT-i,y-o}  =  I  x-o,y-ij 


that  is,  the  system’s  success  function  is  symmetric  in  X  and  Y. 


dS  _  d2S  __  dS 
d(XY)  dXdY  d{X+Y) 


|  S  |  x-r-o}  =  {*|  x-i,y-o)  =  {  5 1  x-o,y-i 


that  is,  the  system  only  depends  on  whether  or  not  X  and  Y  both  work  (only  the 
value  of  the  function  X-  Y  need  be  known,  rather  than  the  individual  status  of  X 
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and  r).  □ 

As  an  example,  consider  again  the  phased  mission  example  presented  above.  If  we 
need  to  find  under  which  circumstances  an  initial  failure  of  either  the  light  truck  or  the 
tanker  will  precipitate  the  failure  of  the  mission,  we  need  to  consider  the  partial  Boolean 
difference 

. _ d^Succ* _ 

{dL'-W-'dL'yidF'W-'dT*)  ’ 

whereas,  if  we  are  concerned  with  the  influence  of  the  simultaneous  failure  of  both  sub¬ 
systems  in  the  last  phase,  then  we  must  consider  the  partial  Boolean  difference 

dSuce * 

d(L*+r‘)' 

4.  Common  cause  failure  analysis 

A  common  cause  may  be  defined  as  an  event  which  precipitates  the  failure  of  one 
or  more  components  of  a  system,  yet  is  not  explicitly  described  in  the  system  [5,14,15]. 
For  instance,  in  a  fault-tree  describing  how  an  integrated  circuit  could  fail,  primary 
events  may  include  cracked  die,  loose  bondings,  input  and  output  short-circuits,  all 
events  which  could  be  have  been  caused  by  excessive  mechanical  or  thermal  stresses. 
Thus  vibrations  and  temperature,  although  not  explicitly  mentioned  in  the  fault  tree, 
could  be  a  major  factor  in  that  circuit’s  reliability  analysis. 

Several  approaches  have  been  proposed  for  common-cause  analysis  (see  [14]  for  a 
brie(  survey),  using  probabilistic  or  logic  methods,  and  trying  to  identify  common  causes 
or  to  assess  their  consequences.  We  outline  below  a  method  based  cn  partial  Boolean 
differences,  which  is  more  flexible  than  most  methods  proposed  to  date  and  lends  itself 
to  both  qualitative  and  quantitative  analysis. 

A  common  cause  may  have  more  complex  direct  consequences  than  the  simple 
failure  of  a  number  of  components;  in  particular,  the  failure  of  a  component  may  pro¬ 
tect  another  from  the  common  event’s  effects.  Thus,  common  cause  analysis  cannot 
proceed  in  a  general  manner  by  substituting  specific  component  failures  for  the  common 
event.  Rather,  the  common  cause  must  be  represented  as  a  (not  necessarily  s-coherent) 
Boolean  function,  call  it  C,  and  the  effects  of  {C}  must  be  investigated.  This  can  be 
easily  done  with  partial  Boolean  differences.  Specifically,  the  common  cause  event,  C, 
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will  precipitate  the  failure  of  the  system,  5,  exactly  when 


dS_ 

dC 


1. 


Examining  the  cut  sets  for  dS/dC  allows  a  qualitative  analysis  of  the  common 
event’s  effects,  while  the  probability  of  its  being  fatal  to  the  system  is  directly  obtained 
by  computing  the  probability  of  the  set  {dS/dC}.  Furthermore,  the  use  of  time- 
dependent  Boolean  differences  allows  us  to  consider  the  time-sequence  effects  of  common 
causes. 


Thus  a  complete  common  cause  analysis  would  proceed  by  first  determining  com¬ 
mon  causes  of  interest  and  expressing  their  effects  on  the  system’s  components  as  a 
(time-dependent)  Boolean  function,  then  computing  the  Boolean  differences,  and  finally 
extracting  minterms,  cut  sets,  etc.,  as  needed  for  the  analysis.  It  is  noted  that  all  of  the 
Boolean  operations  involved  in  computing  Boolean  differences  are  elementary  and  can  be 
easily  carried  out  by  an  automatic  system  (such  as  the  SETS  program  [13]). 

Returning  to  our  phased-mission  example,  consider  the  influence  of  the  common 
cause  event  that  results  in  cutting  the  water  supply  at  the  site.  As  a  result,  both  the 
hydrant  and  the  sprinkler  system  will  fail,  so  that  the  common  event  can  be  written  as 

C  =  HS. 


If  that  event  occurs  during  the  second  phase  of  the  mission,  then  the  continuing  success 
of  the  mission  will  depend  on  the  event  if  the  following  Boolean  difference  evaluates 
to  1: 


d2  Slice* 
dC*~ldC * 


=  P*{~“  ’F*~x’  ( M*~x  T*~l  +  L*'l{  T~l  +  A/‘-|))(Ft(r*  +  M*)  +  M*T*). 


5.  Simple  steady-state  systems  with  repair 

Allowing  for  the  possibility  of  repair  of  faulty  subsystems  results  in  a  much  more 
complex  system.  A  steady-state  condition  can  be  established  within  each  phase,  how¬ 
ever,  and  analyzed  with  standard  modelling  techniques,  such  as  queueing  theory  (see, 
e.g.  [11]). 


A  common  assumption  in  such  analyses  is  that  of  time-independence;  both  the 
failure  and  the  repair  processes  are  treated  as  Poisson  processes,  so  that  the  behavior  of 
the  system  in  a  phase  can  be  derived  from  the  knowledge  of  just  the  failure  and  repair 


rates.  Id  a  phased  mission,  the  time  analysis  is  complicated  by  the  presence  of  phase 
transitions,  which  may  result  in  a  phase  being  initiated  with  previous  failures  still 
present  or  already  repaired,  depending  on  the  interaction  of  the  phase  transitions  and 
the  failure  and  repair  processes.  However,  the  assumption  of  tim»v independence  allows 
us  to  complete  the  analysis  on  a  phase-by-phase  basis,  which  also  allows  for  the  possibil¬ 
ity  of  phase-dependent  failure  and  repair  rates.  Moreover,  and  more  importantly,  the 
Boolean  difference  methods  developed  above  are  still  applicable.  (Since  the  success  func¬ 
tions  of  the  various  subsystems  are  unchanged,  the  difference  is  just  in  the  probability 
computations.)  Finally,  queueing  methods  permit  the  analysis  of  phased  missions  where 
the  change  of  phase  is  itself  a  random  process  (e.g.,  because  it  is  triggered  by  external 
events).  In  fact,  if  the  phase  transitions  are  themselves  time-independent  processes,  the 
analysis  can  be  done  by  superposing  two  finite-state  models,  with  resulting  states 
describing  the  functioning  of  all  subsystems  as  well  the  present  phase. 

Since  the  number  of  states  in  the  final  model  grows  exponentially  as  a  function  of 
the  number  of  system  components,  we  present  a  very  simple  example.  Our  system  has 
three  phases  and  two  distinct  components,  as  shown  in  Figure  4.  We  let  X^,X5  be  the 
failure  rates  of  components  A  and  B,  respectively,  and  pA  ,pg  their  repair  rates;  the 
phase  dependence  of  the  the  rates  is  indicated  by  a  superscript,  according  to  our  time 
notation;  finally,  the  rate  of  a  transition  from  phase  »  to  phase  j  is  indicated  by 
The  resulting  model  has  22*3  =  12  states,  as  shown  in  Figure  5,  where  each  state  is 
labelled  by  three  digits,  denoting,  in  that  order,  the  functioning  of  component  A  (1  for 
working,  0  for  failing),  that  of  component  B,  and  the  phase  number.  (Note  that  transi¬ 
tions  represent  only  a  single  change  in  the  system,  since  simultaneous  changes  have  zero 
probability.) 

The  steady-state  equations  (which  we  can  write  a  a  homogeneous  linear  system  by 
using  the  Markov  transition  rate  matrix)  describe  a  balanced  flow  in  and  out  of  each 
state;  together  with  the  binding  equation  stating  that  the  system  must  be  in  one  of  the 
12  states,  they  allow  us  to  solve  for  the  probability  of  occupation  of  any  state.  Since 
each  state  is  also  a  unique  point  in  the  space  of  the  time-dependent  structure  functions, 
we  can  find  the  probability  of  any  set  of  minterms  by  summing  the  probabilities  of 
occupation  of  the  corresponding  states  (again,  all  of  this  is  easily  expressed  in  matrix 
form). 


S.  Conclusion 


We  have  investigated  the  use  of  Boolean  difference  methods  for  time-sequence  and 
common  cause  analysis  of  coherent  systems,  as  represented  by  fault-trees.  In  particular, 
we  have  shown  how  specific  partial,  time-dependent,  Boolean  differences  can  be  used  for 
the  derivation  of  minimum  and  maximum  sensitization  conditions  and  for  the  analysis 
of  complex  common  causes.  We  have  also  shown  that  such  methods  generalize  without 
changes  to  systems  with  repairs,  as  long  as  events  are  assumed  to  be  time-independent. 
We  conclude  that  Boolean  difference  methods,  which  have  been  and  still  are  widely  used 
for  fault  detection,  have  considerable  potential  in  reliability  and  sensitivity  analysis 
applications. 
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Appendix 


Proof  of  Theorem  1:  By  definition,  we  have 

~  S 1  ( S I  X-i,r-o+  s  I  x-o,y-i+  s  I  x-y-o)  • 


Since  S  is  s-coherent  and  (0,0)  <  (0,1),  S  \  x-Y—o  is  absorbed  in  S  |  x-o,y-i  (since, 
whenever  the  first  term  has  value  1,  the  second  must  have  value  1  to  maintain  s- 
coherence);  thus  we  get 


dS 

*(XY) 


S  |  x— y— i®  (5 1  x-o,y-i+  s  l  x-i,y-o)  • 


Similarly, 

JC 

d{X®Y)  =  1  *-r-o+  S  |  x-y-i)  ©  (S|  x-o,y-i+  s  I  x-i,r-o) . 

but  5 1  x— y— o  gets  absorbed  in  S  |  x—  y— ii  so  that  we  get 

dS  dS 

d(XY)  d{XeY)’ 


whence  our  first  equality.  Again,  by  definition, 
dS 

d(X+  Y)  ~  S 1  x-Y-°®  ( s  I  x-y-i+  s  I  x-o,r-i+  s  I  A'-i,y-o) » 
but  both  S  |  Xmm o,y— i  and  S  \  x— i  ,y— o  get  absorbed  in  5 1  x— y— 1>  so  that  we  get 

d(£+Y)  =  S 1  x-y-°9  5 1  *-y-1’ 
whence  our  second  equality. 

Finally,  we  have 
dXdY  ~  77^ 

=  (5I  X-o®  3 1  x-i)  |  r_0©  ( S  |  x-o®  S  |  x— l)  |  Ymml 


=  S  |  x-y-o®  s  |  x-o,y-i©  s  I  x-i,y-o®  s  I  x— y— i- 
Now  we  establish  the  inequalities  simply  by  remarking  that 

{f}  C  (g)  C  {h}  {g©h}  C  (f©h)  and  {g©f}  C  {h©f}. 


Thus,  since 


| s  1  *-o(y— 1©  s  I  x-t.y-o©  S  \  x_y_o| 

c{s\ x-o,y-i+  S | x-.i,y-o+  s I A-y-oj  Q  js | A-y-tj 

(the  latter  because  of  s-coherence),  we  get  our  first  inequality  by  composition  with 
S  |  j f.y.j.  Similarly,  since 

| s  I  jf-r-oJ  £  | s  I  A-o,y-i©  s  |  A-i,y-o©  s  I  A-y-ij 
c  { S  |  x-o,y-i+  s  I  x-i,y-o+  s  I  jf-y-i) 


(the  former  because  of  s-coherence),  we  get  our  second  inequality  by  composition  with 
S  |  A'-y-o-  □ 


Figure  2.  A  phased  firefighting  mission  (from  Esary  &  Ziehms  [7]). 
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ABSTRACT 

Minimizing  the  site  or  cost  of  a  set  of  tests  without  losing  any  discrimina¬ 
tion  power  is  a  common  problem  in  fault  testing  and  diagnosis,  pattern 
recognition,  and  biological  identification.  This  problem,  known  as  the 
minimum  test  set  problem,  is  known  to  be  NP-hard,  so  that  determining 
an  optimal  solution  is  not  always  computationally  feasible.  Accordingly, 
researchers  have  proposed  a  number  of  heuristics  for  building  approxi¬ 
mate  solutions,  without,  however,  providing  an  analysis  of  their  perfor¬ 
mance.  In  this  paper,  we' take  an  in-depth  look  at  the  main  heuristics 
and  at  the  optimal  solution  methods,  both  from  a  theoretical  and  an 
experimental  standpoint.  We  conjecture  that  the  heuristics  will  yield 
solutions  that  stay  within  of  factor  of  two  of  the  optimal  cost  and  present 
generic  examples  where  this  factor  is  reached  by  any  greedy  heuristic.  We 
then  present  the  results  of  extensive  experimentation  with  randomly  gen¬ 
erated  problems.  While  the  exponential  explosion  suggested  by  the 
problem’s  NP-hardness  is  apparent,  our  results  suggest  that  real  world 
testing  problems  of  large  sizes  can  be  solved  quickly  at  the  expense  of 
large  storage  requirements. 
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1.  Introduction 


•  * 


Identification  problems  arise  in  almost  all  fields  of  scientific  research.  We  are  concerned 
here  with  a  special  type  of  deterministic  identification,  where  an  unknown  (system  state, 
animal  species,  location  of  a  fault)  must  be  classified  in  one  of  a  given  set  of  categories, 
based  on  the  outcome  of  a  set  of  tests.  Each  category  is  characterized  by  a  vector  of 
test  outcomes,  and  an  unknown  object  is  classified  in  that  category  if  its  vector  of  test 
outcomes  matches  the  category’s  characteristic  vector.  The  collection  of  all  categories 
together  with  their  characteristic  vectors  is  known  as  a  diagnostic  table.  A  diagnostic 
table  with  m  categories  and  n  tests  can  be  represented  as  an  m  X  n  matrix,  where  the 
(.*,;)  entry  is  the  result  of  test  Ty  applied  to  the  unknown  object  O,-.  Such  formulation 
is  common  in  testing  and  fault  analysis  [2,4,11],  biology  [15,22,23]  and  pattern  recogni¬ 
tion  [5,13]. 

Given  a  diagnostic  table,  it  is  often  the  case  that  some  tests  are  redundant.  In 
such  a  case,  it  is  of  interest  to  find  the  smallest  suitable  subset  in  order  to  minimize  the 
cost  of  identification.  The  minimum  test  set  (also  known  as  the  test  of  minimum 
length)  is  the  smallest  subset  of  tests  which  discriminates  between  all  categories  dis¬ 
tinguished  by  the  full  set  of  tests.  Knowledge  of  the  minimal  test  set  can  reduce  costs 
in  applications  where  a  rapid  identification  is  needed,  that  is,  in  situations  where  all  the 
tests  will  be  applied  in  parallel.  Cost  reduction  will  also  occur  in  applications  where  the 
capital  costs  (procurement  of  the  test  equipment)  far  exceed  the  running  costs,  regard¬ 
less  of  whether  the  actual  testing  is  done  in  a  parallel  or  sequential  manner.  (This  is  the 
minimization  of  the  acquisition  cost  in  decision  trees  [12].)  Applications  of  the  second 
type  are  to  be  found  in  most  fields  of  human  endeavor,  including  some  that  do  not 
explicitly  include  testing:  servicing  equipment  under  poor  access  conditions  (military 
equipment  in  the  field,  oil  rigs  at  sea),  where  the  cost  of  delivering  service  personnel  and 
apparatus  must  be  minimized;  remote  sensing  missions,  where  the  cost  of  the  apparatus 
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design  for  testability,  where,  for  instance,  the  number  of  checkpoints  added  to  a  circuit 


must  be  minimized  subject  to  retaining  a  prescribed  level  of  testability. 


Unfortunately,  the  minimization  problem  is  known  to  be  NP-hard  [6].  Accord¬ 
ingly,  researchers  have  developed  a  number  of  heuristics  for  building  suboptimal  test 


collections  by  using  variants  of  a  greedy  algorithm  where,  at  each  step,  the  locally 
optimal  test  is  added  to  the  partial  solution.  However,  no  analysis  of  those  methods  is 
offered  in  the  literature. 

In  this  paper,  we  take  an  in-depth  look  at  existing  heuristics  and  how  they  can  be 
applied  to  develop  optimal  solutions.  We  conjecture  that  existing  selection  heuristics 
will  not  exceed  the  optimal  by  more  than  a  factor  of  2  and  provide  generic  examples 
where  this  factor  is  asymptotically  reached  for  all  existing  heuristics.  We  then  present 
and  discuss  the  results  of  extensive  experiments  with  both  artificial  (randomly  gen¬ 
erated)  and  real-world  problems.  While  the  exponential  explosion  suggested  by  the 
problem’s  NP-hardness  is  quite  apparent  in  the  artificial  examples,  our  results  suggest 
that  real-world  problems  of  large  sizes  can  be  solved  in  reasonable  time. 

•  * 

2.  An  Analysis  of  Proposed  Heuristics 

Almost  all  proposed  heuristics  belong  to  the  class  of  greedy  algorithms,  in  that  they 
perform  local,  step-by-step  optimization,  using  a  suitable  selection  criterion.  Very  few 
analytical  results  are  available  about  the  minimum  test  set  problem  in  general  and  the 
behavior  of  the  proposed  heuristics  in  particular.  A  number  of  Russian  researchers 
(8,0,21]  have  studied  the  expected  size  of  the  minimum  test  set  for  randomly  constructed 
tables;  the  analyses  of  the  main  two  heuristics  discussed  in  [12]  in  the  context  of 
identification  trees  do  not  extend  to  the  minimum  test  set  problem.  In  the  following,  we 
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briefly  define  the  four  main  heuristics  proposed  in  the  literature  and  offer  a  partial 
analysis  of  their  worst-case  behavior. 

*  *  S.1  Definition* 


When  a  pair  of  categories  is  distinguished  by  only  one  test  (that  is,  the  categories’ 
characteristic  vectors  differ  in  exactly  one  component),  that  test  is  called  essential  and 
must  be  included  in  any  complete  set  of  tests.  Thus,  in  a  step-by-step  method,  pre¬ 
including  all  essential  tests  is  an  optimal  policy;  all  proposed  methods  [2,3,18,20]  make 
use  of  this  policy. 

When  all  essential  tests  have  been  included,  one  can  either  attempt  to  extend  the 
notion  of  essentiality  or  resort  to  a  measure  of  a  test's  local  optimality.  The  first 
approach  has  been  used  by  researchers  in  microbiology  [16,18,20]:  since  a  test  is  essential 
when  it  is  the  only  test  to  separate  a  pair  of  categories,  a  test  is  “nearly  essential’’  if  it 
is  oue  of  only  two  (or  a  few)  tests  to  separate  a  pair  of  categories.  This  extension  res¬ 
tricts  the  choice  of  the  next  test  to  one  of  those  that  separate  that  pair  of  categories 
which  is  separated  by  the  least  number  of  of  tests  —  the  least-separated  pair.  An  algo¬ 
rithm  using  this  criterion  will  thus  focus  on  category  pairs;  ties  between  tests  and/or 
between  equally  poorly  separated  pairs  are  broken  by  the  use  of  a  “second-level”  heuris¬ 
tic  —  one  of  the  measures  of  local  optimality  described  below.  We  shall  call  this  the 
least- separated  pair  criterion. 

The  second  approach  attempts  to  measure  how  well  a  new  test  will  complement 
those  already  chosen;  in  such  an  approach,  all  as  yet  unincluded  tests  are  considered  for 
inclusion.  An  obvious  choice  is  to  count  how  many  as  yet  unseparated  pairs  the  new 
test  will  distinguish  and  choose  a  test  which  maximizes  this  count:  we  shall  call  this  the 
separation  criterion.  This  heuristic  has  been  extensively  used  in  fault  analysis  [3,4] 
and  microbiology  [16,20]  The  contribution  of  a  test  can  also  be  measured  in  terms  of 
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entropy  (or,  equivalently,  of  information),  in  which  case  the  initial  state  —  a  single 
homogeneous  group  —  corresponds  to  an  entropy  of  0  and  the  final  state  —  m  distinct 
groups  of  one  category  each  —  to  an  entropy  of  log2m ;  it  can  also  be  measured  in  terms 
of  permutations,  where  the  initial  state  corresponds  to  a  value  of  1  —  for  there  is  only 
one  way  to  assign  a  label  to  the  single  initial  set  —  and  the  final  state  corresponds  to  a 
value  of  m!,  the  number  of  ways  in  which  m  distinct  items  can  be  labelled.  The  infor¬ 
mation  theory  approach  is  found  early  in  the  literature  and  used  extensively  for  both 
the  minimum  teat  set  and  the  related  minimum  identification  tree  problems  [4,12,16]. 
Formally,  the  entropy  of  collection  C  of  Jt  clusters,  of  sizes  «|,  .  .  .  ,  ak,  comprising  m 

elements  in  all,  is  defined  as 

* 

H(C)  =  log2m  -  £  «,.  log2sl-. 

.—1 

Applying  a  test  to  a  collection  of  clusters  yields  a  new  collection,  with  larger  (or  equal) 
entropy;  the  difference  is  the  amount  of  information  contributed  by  that  test.  The  test 
which  brings  about  the  largest  increase  will  be  selected.  We  shall  call  this  approach  the 
Information  criterion.  The  combinatorial  approach,  described  in  [17]  and  used  in  [14], 
considers  how  many  possible  distinct  partitions  of  the  size  used  could  exist;  the  loga¬ 
rithm  of  this  quantity  is  used  as  a  measure,  called  repartment  [17].  However,  the 
■  ♦ 

repartment  of  a  partition  of  m  objects  is  equal  to  m  times  the  entropy  of  the  partition 
(within  an  additive  factor  of  log2m),  so  that  the  two  approaches  are  essentially 
equivalent. 

Thus,  we  have  four  main  heuristics:  least  separated  pairs  with  separation  used  to 
choose  among  the  candidate  tests  (at  the  “second  level”);  least  separated  pairs  with 
information  used  at  the  second  level;  separation  alone;  and  information  alone.  The  first 
two  restrict  the  choice  of  tests  before  applying  a  local  measure  of  optimality,  while  the 
last  two  apply  such  a  measure  to  all  remaining  tests. 
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2.2  Aatlyiii 


Examples  are  easily  constructed  that  show  that  no  heuristic  is  uniformly  better  than  the 
other  three.  The  four  heuristics  are  rather  similar:  the  effect  of  restricting  the  choice  to 
those  tests  separating  the  least  separated  pair  does  affect  the  order  in  which  the  tests 
are  selected  —  which  is  of  no  consequence  with  regard  to  the  final  subset  selected;  it 
also  affects  the  composition  of  the  final  subset,  since  a  different  order  or  selection  may 
modify  the  local  measures  of  optimality  of  the  remaining  tests.  Typically,  we  found 
that  these  indirect  effects  are  minor.  Mor  rover,  the  measures  of  local  optimality  are  .all 
convex  functions,  the  minima  and  maxima  of  which  all  occur  at  the  same  points. 

It  is  easily  shown  that,  in  the  worst-case,  none  of  these  heuristics  will  yield  a 
solution  with  a  cost  that  is  at  most  a  constant  away  from  the  the  optimal.  Indeed,  stay¬ 
ing  within  a  fixed  constant  around  the  optimal  is  itself  an  NP-hard  problem  (our  proof 
uses  a  technique  that  has  become  standard  in  the  field:  see  [6,-  pp.  138-139]).  We  show 
that  this  problem  is  NP-hard  by  showing  that,  if  a  heuristic  existed  that  produced  a 
solution  that  had  at  most  k  more  tests  than  the  optimal,  then  we  could  use  it  to  con¬ 
struct  the  optimal  solution  (and  thus  solve  an  NP-hard  problem).  The  idea  is  to  scale 

our  problem  up,  solve  it  with  the  heuristic,  and  then  scale  it  down,  so  that  the  error 
•  * 

margin  will  shrink  to  zero  by  rounding.  Specifically,  given  a  problem,  we  "multiply”  it 
by  ifc+1  by  making  k+l  copies  of  each  object  (regard  each  copy  as  a  coordinate  in  a 
*+l-tuple)  and  thus  Ar+1  copies  of  each  test  (one  copy  for  each  coordinate).  Notice 
that  the  optimal  solution  for  this  problem  has  exactly  Jt+1  times  the  number  of  tests  of 
the  optimal  solution  for  the  original  problem.  The  heuristic  solution  will  fall  within  k  of 
the  optimal  for  the  larger  problem;  now  pick  as  solution  for  the  original  problem  the 
smallest  of  the  test  sets  obtained  by  retaining  from  the  heuristic  solution  only  those 
tests  that  apply  to  the  same  coordinate.  That  set  has  no  more  than  l/(Jt+l)  the 
number  of  tests  of  the  heuristic  solution;  thus  it  has  at  most  k/{k+l)  more  tests  than 


the  optimal  solution;  but  all  quantities  must  be  integer,  so  that  we  have  in  fact  obtained 
the  optimal  solution. 

Furthermore,  another  standard  technique  can  be  used  to  show  that  staying 
within  an  arbitrarily  small  ratio  of  the  optimal  is  also  NP-hard  (i.e.,  no  fully  polynomial 
time  approximation  scheme  [6]  exists  for  the  problem):  it  is  an  immediate  consequence  of 
the  corollary  on  page  141  of  [6]  and  of  the  fact  that  our  problem  is  strongly  NP-hard. 
Thus  the  best  possible  algorithm  is  one  that  produces  solutions,  the  size  of  which  stays 
within  a  fixed  ratio  of  the  optimal. 

We  conjecture  that  all  four  heuristics  discussed  above  exhibit  the  same  worst-case 
behavior,  yielding  a  test  set  that  is,  for  binary  tests,  at  most  twice  larger  than  the 
optimal  —  which,  in  view  of  the  preceding  proof,  is  about  as  good  as  can  be  expected. 
The  main  rationale  behind  our  conjecture  can  be  stated  as  follows.  For  the  ratio  to 
grow  large,  the  optimal  solution  must  remain  small,  thereby  requiring  tests  with  good 
discriminating  power;  however,  those  tests  that  the  heuristic  erroneously  selects  must  be 
even  better  locally.  Now,  the  discriminating  power  of  a  test  is  always  measured  on  the 
whole  partition,  so  that  a  test  cannot  be  good  locally  without  being  fairly  good  overall. 
Therefore,  the  worst-case  behavior  should  occur  when  the  tests  comprising  the  optimal 
solution  are  quite  good  for  the  most  part,  although  marginally  less  good  at  each  step 
than  those  selected  by  the  heuristics.  As  to  the  heuristics,  after  selecting  those  locally 
good  tests,  they  must  complete  their  solution  set  with  locally  poor  tests,  so  that  they 
yield  a  large  solution  set. 

Pushing  this  to  the  extreme,  we  present  below  a  generic  example  where  any  of 

the  proposed  step-by-step  heuristics  will  select  what  appear  to  be  “perfect”!  tests  the 

|  i.e.,  even-splitting  and  such  that  each  successive  test  divides  exactly  in  two  each  of 
the  clusters  determined  by  the  previous  tests. 


sense  that  they  produce  successive  partitions  where  all  subsets  are  of  equal  size),  only  to 
be  forced  to  include  tests  which,  although  initially  good,  are  poor  at  this  point,  each 
effecting  only  a  few  discriminations.  All  apparently  perfect  tests  are  in  fact  redundant 
in  the  final  solution,  because  the  heuristics  had  to  complete  their  partial  set  with  those 
tests  which  alone  would  comprise  the  optimal  solution.  Since  the  example  yields  an 
asymptotic  ratio  of  2,  and  in  view  of  the  above  arguments,  we  conjecture  that  the 
worst-case  performance  ratio  of  any  of  the  proposed  heuristics  never  exceeds  2. 

We  first  construct  the  example  for  heuristics  based  on  local  optimality,  then  show 
how  to  modify  it  by  dropping  a  few  tests  to  make  it  applicable  to  least  separated  pairs 
heuristics  as  well.  The  example  has  2"  objects  and  2"  4-2 n-1  tests  (where  n>3),  in 
three  groups.  The  collection  comprises  all  2”  simple  tests,  s2. },  where  test  4,- 

asks  “does  this  object  belong  to  category  »?”.  Then  there  are  n-l  “perfect”  tests, 
{pj,  .  .  .  ,pn_ ,},  which  by  themselves  determine  a  partition  of  2*'1  subsets  of  2  objects 
each;  they  are  most  easily  constructed  by  filling  the  table  with,  for  each  object,  the  most 
significant  (n-l)  bits  of  its  index  in  the  table.  Finally,  there  are  n  "almost  perfect” 
tests,  {tu  .  .  .  each  of  which  splits  the  categories  2’,_l+l  vs.  2**~l— 1;  we  construct 
them  by  filling  the  table,  for  each  object,  with  its  binary  Gray  code  modified  only  by 
repeating  the  first  code  (all  Os)  and  omitting  the  code  consisting  of  all  Is  —  this  guaran¬ 
tees  the  appropriate  split  by  tilting  the  balance  of  Is  and  Os  for  each  of  those  tests.  Fol¬ 
lowing  is  the  diagnostic  table  for  n=  4. 

Now,  any  heuristic  based  on  local  optimality  will  first  select  all  p  tests,  since  they 
produce  a  perfectly  balanced  partition  at  each  step.  Indeed,  any  fc-step  heuristic  (that 
is,  one  which  selects  tests  k  at  a  time,  for  some  fixed  k,  rather  than  one  at  a  time)  will 
also  select  the  p  tests,  whenever  k  divides  n-l.  After  that,  the  heuristic  will  select  all 
of  the  t{  tests,  which  are  always  preferable  to  the  simple  tests,  and  complete  the  test  set 
by  picking  either  4t  or  a2,  to  yield  a  solution  of  2n  tests.  The  optimal  test  set, 
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however,  comprises  all  tests  and  either  the  or  the  s2  simple  test,  for  a  total  of  n+1 
tests.  Thus  the  ratio  is  2n/(n+l)  for  any  heuristic  that  uses  step-by-step  optimization 
with  a  local  measure  of  discriminating  power. 


To  make  this  example  work  for  the  least  separated  pair  heuristics  with  a  second- 

level  selection  criterion  of  the  type  described  above,  we  need  to  modify  it  so  that  one  of 

the  least  separated  pairs  always  includes  the  next  “perfect”  test,  thereby  ensuring  that 
■  * 

test's  selection.  In  the  example  as  built,  the  least  separated  pairs  are  separated  by  two 
tests.  Therefore,  for  each  p  test,  we  shall  remove  selected  s  tests  to  give  least  separated 
"status”  to  a  pair  separated  by  that  p  test.  In  doing  that,  we  cannot  remove  both  «2,_i 
and  s2 it  since  then  the  (2i-1,2i)  pair  would  be  separated  only  by  the  remaining  t  test, 
making  it  essential;  nor  can  we  remove  either  Sj  or  s2,  since  removing  one  makes  the 
other  essential.  Hence  we  pick  suitable  pairs  separated  by  four  tests  (two  s,  one  t,  and 
one  p),  and  remove  the  two  s  tests  in  order  to  guarantee  the  selection  of  the  p  test. 
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Specifically,  we  remove  the  following  two  a  tests  for  each  p  test.  For  pt,  we 
remove  tests  *2“*+i  aQd  a2***+2,'#+i»  f°r  P»-i>  we  remove  tests  «2a-3  mid  «2. >  finally,  for 
p,-,  2<i<n-l,  we  remove  tests  «  2—^+1  an<l  s2‘"+2,'l-‘+2-  ^  is  easily  verified  that  no 
two  of  those  removed  tests  are  of  the  form  2t~l,2i  and  that  the  pairs  chosen  are 
separated  only  by  the  appropriate  p  test.  The  end  result  is  a  problem  on  which  any  of 
the  proposed  step-by-step  heuristics  —  including  those  that  select  more  than  one  test  at 
a  time  —  will  produce  a  set  of  2n  tests  as  opposed  to  the  optimal’s  n-fl.t 

A  different  approach  yields  another  example  for  which  the  separation-based 
heuristics  will  exhibit  an  asymptotic  ratio  of  2.  The  idea  is  to  transform  known  worst- 
case  examples  for  the  related  set  covering  problem  [1,6,10]  into  minimum  test  set  prob¬ 
lems.  Recall  that  a  set  covering  problem  is  given  by  a  family  of  sets,  the  goal  being  to 
find  the  smallest  number  of  sets  in  that  family  that  cover  the  family,  i.e.,  such  that 
their  union  is  equal  to  the  union  of  all  sets  in  the  family-.  Let  m  be  the  number  of  ele¬ 
ments  in  that  union.  The  transformation  creates  a  pair  of  categories  for  each  of  the  m 
distinct  set  elements;  each  set  in  the  family  gives  rise  to  a  test,  which  takes  values  of  1 
for  the  first  of  the  pair  of  categories  for  each  element  in  that  set  and  values  of  0  every¬ 
where  else.  The  problem  is  completed  by  adding  pog2m"|  tests  to  distinguish  the  m 
pairs  of  categories.  After  selecting  all.  these  tests,  the  separation  heuristics  will  have  to 
select  those  tests  that  correspond  to  the  sets  selected  in  the  set  covering  problem  by  the 
standard  greedy  method.  Johnson  [7]  has  shown  that  the  latter  selection  can  be  arbi¬ 
trarily  far  from  optimal,  by  a  factor  of  log2m;  specifically,  he  provides  a  generic  example 
with  m=3-2*  where  the  optimal  cover  uses  3  sets  and  the  greedy  solution  requires  k+ 1. 

t  One  could  apply  a  backwards  elimination  procedure  on  the  result  of  the  forward 
selection  procedure  (see  [5])  in  order  to  eliminate  some  of  the  redundant  tests. 
However,  the  elimination  of  redundant  tests  is  itself  a  minimum  test  set  problem  and 
thus  not  amenable  to  optimal  solution.  Although  a  greedy-based  backwards 
elimination  procedure  would  indeed  improve  the  greedy  solution  in  this  example, 
cases  remain  where  the  ratio  of  2  would  still  be  reached,  as  our  next  example  shows. 
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This  translates  in  our  problem  to  an  optimal  solution  of  pog2m*|  +3  and  a  greedy  solu- 
tion  of  flog2m"l+it+l.  Since  Ar=log2m-log23,  our  worst-case  ratio  is 
flogom]  -hlog2m— (log23-l) 

n°g2ml  +3 

which  is  always  smaller  than  2  and  reaches  2  asymptotically.  (In  this  example,  the 
greedy  solution  has  no  redundant  tests,  so  that  it  cannot  be  improved  through  the  appli¬ 
cation  of  a  backwards  elimination  procedure.)  This  example  can  also  be  transformed  to 
make  it  work  for  the  information-based  heuristics. 

3.  Bounding  Methods 

Non-exhaustive  search  algorithms  that  find  the  optimal  solution  —  such  as  branch-and- 
bound  or  depth-first  search  —  require  both  upper  and  lower  bounds  on  the  size  of  the 
optimal  solution.  The  bounds  are  used  in  pruning,  i.e.,  in  eliminating  fruitless  directions 
of  search  (pruning  occurs  whenever  the  local  lower  bound  reaches  or  exceeds  the  global 
upper  bound);  they  can  also  be  used  in  guiding  the  selection.  A  global  upper  bound  is 
trivially  provided  by  the  size  of  the  best  solution  found  so  far;  our  conjecture  of  the  pre¬ 
vious  section,  if  true,  would  imply  that  this  bound  is  fairly  tight.  (Indeed,  a  proof  of 
our  Conjecture  would  also  provide  a  lower  bound:  the  optimal  solution  must  be  at  least 
half  as  large  as  the  easily  computed  greedy  solution.) 

Any  measure  of  a  test’s  discriminating  power  that  gauges  distances  on  the  way 
from  the  initial  to  the  final  partition  can  also  be  used  to  derive  lower  bounds.  At  any 
step,  we  compute  the  distance  from  the  partition  determined  by  our  partial  solution  to 
the  final  partition,  as  well  as  the  local  contribution  of  each  available  —  i.e.,  not  yet 
chosen  nor  eliminated  —  test.  We  then  sort  the  available  tests’  contributions  in 
decreasing  order  and,  assuming  no  interaction  between  tests,  find  by  repeated  summing 
how  many  tests  are  needed  to  complete  the  partial  solution.  Since  the  contribution  of  a 
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test  does  not  increase  as  the  partial  solution  is  expanded,  this  gives  us  a  safe  lower 
bound. 
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The  separation-derived  function  gives  us  a  lower  bound  on  the  number  of  addi¬ 
tional  tests  needed  to  distinguish  the  remaining  unseparated  pairs,  assuming  that  no  two 
tests  separate  the  same  pair  —  a  very  unlikely  event.  The  information-derived  function 
finds  a  lower  bound  on  the  number  of  additional  tests  needed  to  increase  the  entropy  to 
the  final  partition’s  value,  assuming  no  cross-information*  between  tests  —  an  unlikely 
assumption,  but  one  that  can  be  closely  approximated.  We  note  that,  whenever  there  is 
cross-information  between  two  tests,  there  will  be  some  pairs  that  they  both  distinguish, 
but  that  the  converse  is  false.  For  instance,  if  m  is  a  power  of  2  and  log2m  of  the  tests 
perfectly  complement  each  other,  then  those  tests  have  no  cross-information,  yet  any 
two  of  them  distinguish  m2/ S  common  pairs.  Thus  we  expect  the  separation  bound  to 
be  much  looser  than  the  information  bound. 

A  simple  example  will  illustrate  our  point.  Consider  a  p.roblem  with  m  (an  even 
number)  categories,  where  all  tests  effect  an  even  split,  m/2  vs.  m/2.  The  tightest  pos¬ 
sible  lower  bound  on  the  size  of  the  optimal  solution  is  pog2m"|,  an  achievable  size. 

Now,  each  test  separates  (m/2)-(m/2)  =  m2/ 4  pairs  and  brings  an  increase  in  entropy 

♦ 

of  1  bit,  so  that  the  separation-derived  bound  is 


V 
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m  (m-l)/2  1  _  f  2  (m-l) 

m2/ 4  I  ~  |  m 

while  the  information-derived  bound  is 


log2m 

1 
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t  There  is  cross-information  between  two  tests  if  the  amount  of  information  which 
they  contribute  together  is  less  than  the  sum  of  the  amounts  which  they  contribute 
individually. 
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The  information-derived  bound  is  as  tight  as  possible;  the  separation-derived  bound  is 
off  by  an  unbounded  factor. 

The  better  the  tests  are,  the  poorer  the  separation-derived  lower  bounds  become; 
in  fact,  the  case  illustrated  above  is  essentially  the  average  case  in  random  tables  with 
large  numbers  of  tests,  since  well-splitting  tests  are  far  more  likely  than  others.  Thus 
the  separation-derived  bounds  will  be  practically  useless  in  problems  that  have  a  large 
number  of  tests  relative  to  their  number  of  categories  (a  “wide”  diagnostic  table).  Even 
for  “square"  or  “tall”  diagnostic  tables,  those  bounds  will  be  effective  only  if  the  tests 
are  rather  poor. 

In  fact,  the  information-derived  bound  can  also  be  arbitrarily  smaller  than  the 

size  of  the  optimal  solution,  although  the  factor  cannot  grow  as  large  as  for  the 

separation-derived  bound.  Clearly,  the  worst  possible  behavior  for  the  information- 

derived  bound  is  to  indicate  the  need  for  a  logarithmic  number  of  tests  (close  to  the 

theoretical  minimum)  while  the  problem  in  fact  requires  a  linear  number  of  tests  (close 

to  the  theoretical  maximum).  Such  a  behavior  can  be  observed  in  a  square  diagnostic 

table  where  the  1  entries  are  disposed  so  as  to  make  the  table  into  a  lower  triangular 

matrix.  Thus,  for  m  object  categories,  the  information-derived  bound  can  be  off  by  at 
■  % 

most  a  factor  of  0(m/log2m),  while  the  separation-derived  bound  can  be  off  by  a  factor 
of  O(m). 


4.  Experimental  Results  and  Discussion 
4.1  Goals  and  methodology 

The  goals  of  experimentation  were  three-fold:  to  verify  our  deductions  about  the  selec¬ 
tion  criteria  and  bounding  functions;  to  determine  how  much  work  was  expended  on 
finding  the  optimal  solution  (as  opposed  to  verifying  its  optimality);  and  to  obtain  an 


estimate  of  the  size  of  the  largest  problems  that  could  be  solved  in  a  reasonable  amount 
of  time  by  these  techniques. 

Faced  with  a  problem  of  subset  search,  the  algorithm  designer  usually  has  a 
choice  of  four  techniques:  dynamic  programming,  cutting  plane  techniques,  branch-and- 
bound,  and  depth-first  search.  The  minimum  test  set  problem  has  no  apparent  formula¬ 
tion  in  the  framework  of  dynamic  programming.  Integer  programming,  through  the 
solution  of  its  linear  programming  subproblem  and  the  use  of  cutting  planes,  has  proved 
very  effective  with  the  related  (also  NP-hard)  problem  of  set  covering  [1].  Unfor¬ 
tunately,  the  linear  programming  formulation  of  the  minimum  test  set  problem  requires 
a  number  of  equations  that  grows  as  a  quadratic  function  of  the  problem’s  size  (as 
opposed  to  a  linear  function  for  the  set  covering  problem),  thereby  producing  an  exces¬ 
sively  large  system.  The  last  two  methods  are  more  attractive  for  our  purposes  since 
they  both  perform  an  explicit  search  of  the  state  space  as  guided  by  selection  criteria 
and  bounding  functions.  An  estimate  of  the  amount  of  storage  needed  for  the  inter¬ 
mediate  solutions  in  a  straightforward  implementation  of  branch-and-bound  techniques 
shows  that  memory  requirements  are  too  large  to  allow  the  solution  of  problems  of  use¬ 
ful  size. 

■  * 

Therefore  we  chose  the  depth-first  search  technique  (also  known  as  single  branch 
enumeration).  It  has  the  advantage  of  requiring  only  the  storage  of  a  single  path  in  the 
search  space  (whereas  branch-and-bound  techniques  may  require  an  exponential 
number).  Also,  since  the  first  solution  produced  by  the  depth-first  search  algorithm  is 
the  greedy  solution,  the  chosen  algorithm  has  the  added  advantage  of  allowing  immedi¬ 
ate  comparisons  between  selection  criteria.  Finally,  the  two  methods  based  on  the 
least-separated  pair  should  work  very  well  in  most  cases,  as  the  restriction  on  the  choice 
of  candidate  tests  should  drastically  diminish  the  size  of  the  search  space.  We  wrote 
four  PASCAL  programs,  each  implementing  one  of  the  four  selection  heuristics  with  its 
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matching  bounding  function.  The  global  upper  bound  is  provided  by  the  size  of  the 
best  solution  found  so  far;  our  conjecture  implies  that  this  bound  is  fairly  tight.  The 
lower  bounds  can  be  derived  from  the  local  contributions  of  each  remaining  test  and  the 
distance  to  the  solution:  we  assume  that  the  tests  do  not  interact  and  find  by  repeated 
summing  of  the  tests’  contributions  (sorted  in  decreasing  order)  how  many  tests  are 
needed  to  complete  the  partial  solution. 

All  four  programs  pre-include  essential  tests  whenever  such  are  to  be  found.  It  is 
noted  that,  although  only  those  tests  that  are  essential  with  the  initial  partition  must 
appear  in  any  solution,  the  backtracking  process  may  give  rise  to  “locally  essential” 
tests,  since  the  process  of  removing  a  test  from  consideration  in  the  subtree  may  make 
other  tests  essential  in  that  region  of  the  state  space.  As  a  result,  an  efficient  implemen¬ 
tation  requires  a  pair-oriented  data  structure,  keeping  track  of  which  available  tests  dis¬ 
tinguish  each  unseparated  pair,  so  that  the  search  for  essential  tests  can  proceed  in 
nearly  constant  time.  This  data  structure  can  grow  very  large  and  its  memory  require¬ 
ments  turned  out  to  be  the  main  limiting  factor  in  real  world  examples. 

All  four  programs  were  run  on  randomly  generated  problems  and  those  two  pro¬ 
grams  based  on  the  separation  measure  were  also  run  on  real  world  examples  excerpted 
from  the  microbiology  literature  (in  which  test  sets  are  regularly  published).  Nearly  all 
real  world  examples  included  variable  outcomes  (i.e.,  undefined  test  values)  and  a  few 
had  multiple- valued  (as  opposed  to  binary)  tests.  All  artificial  examples  had  binary 
tests  only  and  the  random  generator  was  set  so  that  the  two  possible  outcomes  would  be 
exactly  balanced  over  the  whole  table.  (Such  problems  are  harder  than  those  where  one 
outcome  is  favored,  because  the  even  balance  introduces  a  bias  in  favor  of  well-splitting 
tests.  We  also  ran  a  number  of  experiments  with  various  skews;  all  proved  noticeably 
easier  to  solve  than  their  evenly  balanced  counterparts.)  When  the  number  of  tests  was 
small  for  the  number  of  objects,  the  randomly  generated  table  often  did  not  distinguish 
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between  all  objects;  in  such  a  case,  a  solution  is  a  subset  of  tests  that  effects  as  much 
discrimination  as  the  full  set  of  tests.  The  data  collected  included  the  size  of  the  greedy 
solution  and  of  the  optimal  solution,  the  initial  lower  bounds,  the  number  of  backtracks 
needed  to  reach  the  solution,  and  the  total  number  of  backtracks  used. 

4.2  Artificial  examples 

In  order  to  study  the  influence  of  the  number  of  categories  and  that  of  the  number  of 
tests  on  the  behavior  of  the  algorithms,  four  series  of  experiments  were  run.  In  two  of 
the  series,  the  number  of  categories  was  kept  constant  while  the  number  of  tests  was 
varied;  in  one  series,  the  process  was  reversed;  and  in  the  last  series,  all  problems  were 
square  with  increasing  sizes.  The  sizes  varied  from  6  to  64,  with  varying  resolution. 
Twenty-five  examples  were  generated  for  each  size  in  each  series  and  their  results  aver¬ 
aged;  in  all,  nearly  2500  examples  were  run. 

The  results  are  presented  in  graphical  form  in  Figures  1-6.  Each  of  the  first  four 
figures  displays  the  data  collected  in  one  series  of  experiments  in  the  form  of  four 
graphs:  the  top  two  show  the  average  total  number  of  backtracks  required  (the  most 
accurate  measure  of  work)  as  well  as  the  average  size  of  the  solution,  while  the  bottom 
two  show  the  percentage  of  work  that  was  devoted  to  finding  the  optimal  .solution  (as 
opposed  to  verifying  it).  The  two  graphs  on  the  left  display  the  data  obtained  with  the 
least  separated  heuristics  and  those  on  the  right  are  concerned  with  the  other  two 
heuristics.  In  all  graphs,  data  points  marked  with  a  triangle  correspond  to  results 
obtained  with  the  information-derived  bounding  and  selection  functions;  those  marked 
with  a  square  correspond  to  results  obtained  with  the  separation-derived  bounding  and 
selection  functions;  and  those  marked  with  a  circle  indicate  the  average  size  of  the  solu¬ 
tion.  Figure  5  illustrates  the  performance  of  the  greedy  methods  (it  displays  the  average 
and  largest  values  of  the  ratio  of  the  size  of  the  greedy  solution  to  that  of  the  optimal 


solution)  while  Figure  6  illustrates  the  performance  of  the  bounding  methods  (it  shows 
the  average  and  largest  values  of  the  ratio  of  the  initial  lower  bound  to  the  size  of  the 
optimal  solution).  Finally,  curves  were  passed  through  the  points  in  order  to  make  the 
graphs  more  legible.  (Those  curves  should  not  be  taken  as  an  accurate  depiction  of  the 
heuristics’  behavior:  the  data  are  intrinsically  discrete.) 

In  the  first  series,  all  problems  had  16  categories,  while  the  number  of  tests  varied 
from  8  to  64  in  steps  of  2.  Since  16  is  a  power  of  2,  it  is  a  transition  point  for  the  size 
of  the  best  possible  test  set:  although  problems  of  that  size  could  admit  a  solution  of  4 
tests,  such  a  solution  would  be  hard  to  find,  since  it  must  be  composed  of  4  perfect  tests 
with  no  cross-information.  For  such  a  solution  to  exist,  a  rather  large  choice  of  tests 
must  be  provided;  therefore  we  expect  the  average  size  of  a  solution  to  stay  above  5 
until  the  number  of  tests  is  large,  then  to  decrease  slowly.  Since  a  solution  of  5  tests  is 
optimal  in  most  cases,  we  expect  that  a  good  bounding  function  will  stop  the  search 
algorithm  shortly  after  the  solution  is  found.  Moreover,  until  a  solution  of  4  tests  b 
feasible,  there  will  be  a  number  of  optimal  solutions  of  size  5,  so  that  one  such  solution 
will  be  found  almost  immediately  by  all  heuristics.  Figure  1  shows  the  experimental 
results,  which  confirm  our  expectations.  While  the  information-based  bounding  did  very 
well,  the  separation-based  one  did  very  poorly  —  because  many  of  the  tests  are  well- 
splitting.  Since  the  information  bounds  were  tight,  the  reduced  branching  factor  associ¬ 
ated  with  the  least  separated  pair  heurbtics  did  not  play  a  significant  role,  while  that 
role  b  clearly  exemplified  by  the  two  programs  using  the  separation  bounds. 

In  the  second  series,  all  problems  had  22  categories;  other  parameters  were  as  in 
the  first  series.  With  22  categories,  the  best  possible  test  set  has  size  5.  Such  a  solution 
b  not  as  difficult  to  realize  as  in  the  first  series,  since  it  b  well  above  log322  =  4.46.  On 
the  other  hand,  the  bounding  can  be  decbive  only  if  a  solution  of  5  tests  b  reached;  the 
test  interactions  that  make  a  6-test  set  optimal  will  not  be  reflected  strongly  enough  in 


the  bounds.  As  a  result,  we  expect  the  programs  to  perform  a  nearly  exhaustive  search 
of  the  first  few  levels  of  the  search  tree  when  the  optimal  solution  has  6  tests.  Of 
course,  such  solutions  abound,  so  that  all  heuristics  will  find  one  almost  instantly,  while 
solutions  of  5  tests  will  be  considerably  more  difficult  to  find  until  the  choice  of  tests 
becomes  sufficiently  large.  As  that  choice  grows,  all  four  heuristics  will  find  a  solution 
of  5  tests  very  early;  the  information-bounded  programs  will  then  stop  shortly,  while  the 
separation-bounded  ones  will  go  on  and  explore  nearly  the  full  tree.  The  role  of  the 
reduced  branching  factor  of  the  least  separated  pair  heuristics  will  be  as  in  the  first 
series,  minor  for  the  information-bounded  programs  and  major  for  the  other  two. 
Experimental  results  are  shown  in  Figure  2. 

In  the  third  series,  the  number  of  tests  was  kept  at  16,  while  the  number  of 
categories  was  varied  from  8  to  64  in  steps  of  2.  As  the  number  of  categories  increases, 
so  does  the  size  of  the  theoretically  minimal  solution;  the  size  of  the  best  realizable  solu¬ 
tion  increases  even  faster,  since  the  choice  of  tests  becomes  relatively  small.  With  a 
large  number  of  objects,  the  tests  will  interact  significantly,  so  that  we  expect  bounds  to 
be  rather  loose  and  play  only  a  minor  role.  On  the  other  hand,  the  probability  that  a 
pair  is  separated  by  only  one  or  two  tests  significantly  increases,  so  that  we  expect  the 
least  separated  pair  heuristics  to  perform  considerably  better  than  the  other  two. 
Finally,  the  selection  heuristics  will  do  rather  well  because  the  tests,  with  so  many 
entries  in  each  column,  are  well  differentiated.  For  small  to  medium  numbers  of 
categories,  the  situation  is  more  complex.  The  selection  heuristics  will  perform  rather 
poorly  for  a  number  of  categories  just  above  the  number  of  tests,  because  the  optimal 
solution  is  likely  to  be  nearly  unique.  At  the  same  time,  the  number  of  categories  is  not 
large  enough  that  good  bounding  can  take  place;  thus  the  overall  work  should  increase 
dramatically.  However,  the  bounding  and  selection  will  improve  as  the  number  of 
categories  and  of  optimal  solutions  increases,  so  that  the  work  done  will  level  out.  As 


the  number  of  categories  further  increases,  it  becomes  more  difficult  again  to  find  the 
optimal  solution  and  we  expect  the  total  amount  of  work  to  increase  once  more.  The 
results  are  shown  in  Figure  3.  Notice  how  closely  the  curves  for  the  information- 
bounded  programs  follow  those  for  the  separation-bounded  ones,  demonstrating  the  rela¬ 
tive  lack  of  success  of  the  bounding  functions. 

The  fourth  series  had  square  problems,  with  a  size  increasing  from  6  to  60.  (Only 
the  best  of  the  four  heuristics  was  used  for  the  larger  sizes;  indeed,  40x40  appeared  to  be 
the  practical  limit  for  the  separation- bounded  heuristics.)  Since,  in  a  square  problem, 
the  choice  of  tests  is  relatively  restricted,  we  can  expect  a  behavior  similar  to  that  exhi¬ 
bited  in  the  first  half  of  Figure  3.  Least  separated  pair  heuristics  will  hold  a  slight  edge 
over  the  other  two  and  the  information-guided  heuristics  will  vary  in  performance 
between  much  better  than  the  separation-guided  —  when  the  solutions  become  harder 

to  find  and  thus  provide  for  better  bounding  —  and  almost  as  poor  —  when  the  solu- 

•  . 

tions  become  easier  to  find  and  thus  cause  much  looser  bounds.  Experimental  results 
are  shown  in  Figure  4.  They  dramatically  illustrate  the  trade-off  between  tight  bound¬ 
ing  and  ease  in  finding  solutions:  the  harder  a  solution  is  to  find,  the  easier  it  will  be  to 
prune  the  remaining  branches,  yet,  if  the  solution  is  too  hard  to  find,  most  of  the  tree 
will  be  explored  just  looking  for  it. 

The  data  collected  about  the  size  of  the  greedy  solutions  confirmed  that  all  four 
heuristics  are  good  selection  criteria.  No  greedy  solution  ever  exceeded  the  optimal  by 
more  than  5095;  on  the  average,  greedy  solutions  were  only  6%  to  7%  larger  than 
optimal.  As  expected  from  our  discussion  of  the  separation  and  information  measures, 
the  two  performed  equally  well  (the  average  ratios  were  always  within  one  percent  of 
each  other,  which  is  not  a  statistically  significant  difference  over  25  experiments);  the 
two  methods  relying  on  the  least-separated  pair  heuristic  showed  a  slight  advantage, 
presumably  due  to  the  narrower  focus  they  impart  on  selection.  Figure  5  presents  the 


results  (the  average  ratios  of  the  size  of  the  greedy  solution  to  the  size  of  the  optimal 
solution)  in  the  form  of  four  graphs  (one  for  each  series  of  experiments);  in  each  graph, 
data  points  marked  with  a  cross  (X)  correspond  to  the  least-separated  pair  heuristics 
while  those  marked  with  a  plus  (+)  correspond  to  the  first-level  heuristics. 

We  chose  to  illustrate  the  behavior  of  the  bounding  methods  by  collecting  statis¬ 
tics  on  the  ratio  of  the  size  of  the  optimal  solution  to  the  initial  lower  bound  (as  derived 
by  using  separation  or  information  measures).  The  average  and  worst-case  values  of 
this  ratio  are  plotted  in  Figure  6  in  four  graphs  (one  for  each  series  of  experiments);  in 
each  graph,  data  points  marked  with  a  triangle  correspond  to  information-based  bounds, 
while  those  marked  with  a  square  correspond  to  separation-based  bounds.  As  expected 
from  our  discussion,  the  lower  bounds  based  on  information  are  consistently  better  that 
those  base  on  separation.  In  particular,  while  the  separation-derived  bounds  worsen 
with  increasing  number  of  objects,  no  such  trend  is  apparent  for  the  information-derive 
bounds. 

Overall,  the  experimental  results  confirmed  our  evaluation  of  the  selection  criteria 
and  the  bounding  functions.  All  four  selection  criteria  appear  equal.  .Least  separated 
heuristics  are  vastly  superior  to  the  other  two  when  efficient  bounding  is  not  possible  (as 
when  the  number  of  categories  grows  large  with  respect  to  the  number  of  tests),  due  to 
their  small  branching  factor.  Information-bounded  heuristics  are  much  better  than  the 
other  two  when  the  optimal  solution  is  found  early  and  efficient  bounding  can  be  done 
(as  when  the  number  of  tests  grows  large  with  respect  to  the  number  of  categories).  In 
all  cases,  the  most  efficient  program  used  the  least  separated  pair  heuristic  with 
information-based  bounding.  The  largest  solvable  problems  had  sizes  of  around  40  by 
40,  although  a  single  parameter  could  be  increased  well  beyond  that. 


4J  Real  world  examples 


III  view  of  the  results  obtained  with  artificial  examples,  a  certain  optimism  is  justified  as 
regards  real  world  problems.  Such  problems  tend  to  have  many  essential  tests;  more¬ 
over,  they  are  often  composed  of  a  small  number  of  well-splitting  tests  and  a  large 
number  of  rather  poor  (possibly  simple)  tests.  With  such  a  structure,  selection  criteria 
should  perform  well,  as  several  microbiology  researchers  have  found  [16,10].  Moreover, 
many  pairs  will  be  be  separated  by  only  a  few  tests,  so  that  the  least  separated  pair 
heuristics  should  keep  the  branching  factor  quite  low. 

We  used  the  separation  criterion  only,  as  the  information  criterion  is  not  easily 
adapted  to  problems  with  variable  test  outcomes.  (Being  based  on  clusters,  it  requires 
that  the  size  of  all  clusters  established  by  the  inclusion  of  tests  be  recorded.  In  turn, 
this  requires  that  all  clusters  be  kept  track  of  explicitly,  since  common  sets  and  subsets 
must  be  eliminated.  All  of  this  adds  up  to  excessive  bookkeeping  and  enormous  storage 
requirements.)  The  table  below  presents  a  synopsis  of  the  results  on  eight  examples 
from  the  microbiology  literature;  in  that  table,  LSP  stands  for  the  least  separated  pair 
heuristic  with  separation  bounding  while  SEP  stands  for  separation  as  the  “first-level” 
heuristic.  The  data  presented  include  the  size  of  the  problem,  the  size  of  the  optimal 
solution  and  that  of  the  greedy  solutions  found  by  each  heuristic,  the  number  of  essen¬ 
tial  tests,  the  total  number  of  backtracks  used  by  each  heuristic  in  obtaining  the 
optimal  solution,  and  the  percentage  of  work  used  to  discover  (as  opposed  to  verify)  the 
optimal  solution.  Several  remarks  are  in  order.  First,  many  of  the  tests  incorporated  in 
an  optimal  solution  were  essential,  showing  how  important  it  is  for  an  algorithm  to 
include  essential  tests  whenever  possible.  Secondly,  the  optimal  solution  was  almost 
always  found  immediately,  confirming  the  power  of  the  greedy  heuristics  in  real  world 
examples.  Third,  some  of  the  problems  run  were  four  times  larger  than  the  largest 
artificial  examples  attempted,  yet  ran  almost  a  hundred  times  faster.  Finally,  the 


Isolates 

Problem  Size 
Categories  Tests 

Solution 

Opt.  LSP  SEP 

Number 

Ess.Tests 

Backtracks 

LSP  SEP 

%  Work  to  Sol. 
LSP  SEP 

Aetinomadura1 

11 

32 

5 

5 

7 

1 

7 

20 

0.0 

67.2 

Cyanobacteria2 

106 

19 

16 

16 

16 

15 

16 

16 

0.0 

0.0 

Enterobacteria3 

7 

20 

7 

7 

7 

4 

7  - 

7 

0.0 

0.0 

Pseudomonas  (14] 

27 

21 

8 

9 

8 

2 

16 

21 

21.7 

0.0 

SMA12  kit3 

142 

12 

12 

12 

12 

12 

12 

10 

0.0 

0.0 

Streptococci3 

36 

32 

25 

25 

25 

25 

25 

25 

0.0 

0.0 

Streptococci  (18] 

50 

122 

36 

36 

36 

31 

43 

43 

0.0 

0.0 

Yeasts4 

98 

56 

16 

16 

17 

5 

144 

1556 

0.0 

0.2 

1  Good  fellow,  M.,  et  al.  Numerical  taxonomy  of  Aetinomadura  and  related  Aetinomyeete*.  J.  Gen.  Micro* 
biol.  IIS  (1970).  pp  95-111. 

2  Rippka,  R.,  et  al.  Generic  assignments,  strain  histories  and  properties  of  pure  cultures  of  cyanobacteria. 
J.  Gea.  Microbiol.  Ill  (1979).  pp.  1-61. 

3  Rypka,  E.W.  Private  communication.  Lovelace  Medical  Center,  Albuquerque,  1981. 

4  Belia,  J.M.  Identification  of  yeasts  and  yeast-like  fungi  I:  taxonomy  and  characteristics  of  new  species 
described  since  1973.  Can.  J.  Microbiol.  S7  (1981)).  pp.  1235-1251. 

program  using  the  least  separated  pair  heuristic  with  separation  bounding  never  used 
more  than  a  minute  of  CPU  time  running  on  a  VAXll/780  computer.  (For  comparison, 
the  greedy  heuristic  used  in  (14]  on  the  Pseudomonas  example  took  2.8  minutes  on  an 
IBM360/50,  while  our  program  took  1.9  seconds  to  guarantee  an  optimal  solution  for  the 
same  problem  —  an  enormous  difference  that  we  attribute  mostly  to  our  careful  choice 
of  data  structures.) 

Overall  our  results  confirmed  our  optimism  about  real-world  problems  and  justify 
an  even  more  positive  attitude:  with  a  judicious  trade-off  between  time  and  space,  it  will 
be  possible  to  solve  even  larger  examples  without  major  modifications.  If  some  better 
bounding  method  can  be  developed  —  and  preliminary  research  indicates  that  this  is 
within  reach,  using  linear  programming  with  merged  constraints,  then  the  time  traded 
off  will  easily  be  regained.  In  fact,  this  indicates  that  a  hybrid  algorithm,  partaking  of 
both  depth-first  search  and  branch-and-bound  techniques,  may  be  best. 
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S.  Concluaion 

We  have  reviewed  the  methods  proposed  in  the  literature  for  dealing  with  the  minimum 
test  set  problem.  We  have  evaluated  the  proposed  selection  heuristics  and  conjectured 
that  their  worst-case  behavior  never  produces  solutions  larger  than  twice  the  optimal, 
providing  two  examples  where  this  ratio  is  asymptotically  reached.  We  have  presented 
the  results  of  extensive  experimentation  with  four  backtracking  algorithms.  Our  results 
confirm  that  existing  selection  heuristics  are  quite  satisfactory;  they  also  indicate  that 
the  best  backtracking  method  involves  a  heuristic  which  uses  the  information  criterion 
for  selecting  tests  and  deriving  bounds  and  relies  on  the  least  separated  pair  heuristic  to 
keep  branching  factors  low.  Experimentation  with  real  world  problems  showed  the 
importance  of  pre-inclusion  of  essential  tests;  it  also  gave  grounds  for  optimum  since, 
despite  the  known  NP-hardness  of  the  general  problem,  an  inferior  version  of  our  pro¬ 
grams  solved  large  problems  in  a  very  short  time. 

Much  work  remains  to  be  done.  Better  bounding  methods  must  be  sought,  which 
apply  even  when  variable  outcomes  are  present.  An  extension  of  the  information  cri¬ 
terion  is  the  obvious  first  step.  Beyond  that,  the  use  of  the  linear  programming  sub¬ 
problem  for  deriving  bounds  (as  used  for  the  set  covering  problem  in  [10])  appears 
promising;  although  the  size  of  the  linear  programming  problem  grows  faster  than  that 
of  the  original  problem,  one  can  diminish  it  by  merging  some  conditions,  thereby  relax¬ 
ing  the  constraints  (indeed,  merging  all  conditions  into  one  gives  us  the  separation  cri¬ 
terion).  In  addition,  the  integer  programming  approach  with  cutting  plane  methods  is 
worth  investigating,  despite  the  size  of  its  formulation.  Most  importantly,  ways  of 
incorporating  measured  amounts  of  redundancy  into  the  solution  must  be  sought:  the 
minimum  test  set  is,  by  definition,  a  fragile  entity.  While  redundancy  can  easily  be 
incorporated  through  simple  methods  (such  as  insisting  that  each  pair  be  separated  by 
at  least  two  tests,  whenever  possible),  the  more  complete  risk  model  of  pattern 
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recognition  [5]  provides  a  suitable  framework  for  more  sophisticated  methods. 
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Optimal  Solution  of  Linear  inequalities  with 
Applications  to  Pattern  Recognition 

D.  C.  CLARK  and  R.  C.  GONZALEZ,  senior  member,  ieee 


Abstract- Kn  algorithm  foe  the  optimal  solution  of  consistent  and  in¬ 
consistent  linear  inequalities  is  presented,  where  the  optimality  crite¬ 
rion  is  the  maximization  of  the  number  of  constraints  satisfied.  In  the 
terminology  of  pattern  recognition,  the  algorithm  finds  a  linear  decision 
function  which  minimizes  the  number  of  patterns  misetaxsified.  The 
algorithm  is  developed  as  a  nonenumerative  search  procedure  based  on 
several  new  results  established  in  this  paper.  Bounds  on  the  search  are 
also  developed  and  the  method  is  experimentally  evaluated  and  shown 
to  be  computationally  superior  to  other  techniques  for  finding  mini¬ 
mum-error  solutions. 
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.  I.  Introduction 

ORMAL  APPROACHES  to  pattern  recognition  may  be 
divided  into  twp  principal  categories:  syntactic  and  deci¬ 
sion-theoretic  [1],  [2],  The  syntactic  approach  is  based  on 
the  use  of  automata  and  language  theory  to  process  patterns 
that  have  been  expressed  in  terms  of  structural  primitives.  The 
decision-theoretic  approach,  on  the  other  hand,  deals  with 
techniques  for  obtaining  decision  functions  capable  of  parti¬ 
tioning  sets  of  pattern  vectors  whose  components  are  real,  nu¬ 
merical  measurements  or  features. 

Central  to  the  decision-theoretic  approach  are  methods  for 
finding  decision  functions  that  are  optimal  in  some  sense.  In 
statistical  formulations,  the  approaches  due  to  Fisher  [3]  and 
Bayes  (1]  are  the  most  commonly  used  in  pattern  recognition. 
Fisher’s  classic  paper  establishes  a  procedure  for  finding  a  lin¬ 
ear  discriminant  function  with  the  maximum  ratio  of  interclass 
to  intraclass  scatter.  Bayes’  decision  rule  yields  the  minimum 
expected  loss  and,  in  the  Gaussian  case,  reduces  to  a  quadratic 
function  determined  by  the  mean  vectors  and  covariance  ma¬ 
trices  of  the  classes  under  consideration. 

A  criterion  of  optimality  that  has  been  receiving  increased 
attention  in  recent  years  is  based  on  finding  decision  functions 
which  minimize  the  number  of  errors  between  two  pattern 
classes.  Unlike  the  formulations  mentioned  above,  <he  ap¬ 
proaches  that  have  been  proposed  in  this  area  are  procedures 
which  employ  the  training  patterns  directly  to  find  minimum- 
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error  decision  functions.  The  most  noteworthy  efforts  in  this 
area  are  the  algorithms  proposed  by  Ibaraki  and  Muroga  [4] , 
Warmack  and  Gonzalez  [5] ,  Miyake  and  Shinmura  [6] ,  [7] , 
and  Miyake  [8] .  Other  schemes  which  “tend"  to  minimize 
the  number  of  errors  have  been  proposed  by  Mengert  [9]  and 
Smith  [10]  (see  also  the  comment  by  Grinold  [11]).  Finally, 
we  mention  the  stochastic  optimization  techniques  by  Wassel 
[17]  and  by  Do-Tu  and  Installe  [18]  for  minimizing  the  error 
rate. 

A  two-class  linear  decision  problem  may  be  expressed  as  a 
system  of  linear  inequalities  [1],  [5].  In  this  paper,  we  de¬ 
velop  an  algorithm  for  finding  an  optimal  solution  of  consis¬ 
tent  (corresponding  to  separable  pattern  classes)  and  inconsis¬ 
tent  (corresponding  to  inseparable  classes)  linear  inequalities, 
where  the  optimality  criterion  is  the  maximization  of  the  num¬ 
ber  of  constraints  satisfied  by  the  solution.  This  criterion  is  di¬ 
rectly  analogous  to  minimizing  the  number  of  misclassified 
patterns.  The  algorithm  is  developed  as  a  nonenumerative 
search  procedure  based  on  several  new  results  established  in 
this  paper.  Bounds  on  the  search  are  also  developed  and  the 
procedure  is  experimentally  evaluated  and  shown  to  be  com¬ 
putationally  superior  to  other  published  techniques  for  finding 
minimum-error  solutions. 

II.  Foundation 

Consider  the  system  of  homogeneous  linear  inequalities 

Aw>0  (2.1) 

where  A  is  an  m  X  (n  +  1)  matrix  with  m  >  (n  +  1),  and  w 
is  an  (zr+  l)-vector  in  R"*'.  It  will  be  assumed  throughout 
the  following  discussions  that  A  satisfies  the  Haar  condition 
(HI  ;  that  is,  every  (n  +  1)X  (n  +  1)  submatrix  of  A  is  of  rank 
(n  +  l). 

In  the  terminology  of  pattern  recognition,  w  is  a  weight  vec¬ 
tor,  each  row  of  A  corresponds  to  an  augmented  pattern  vec¬ 
tor  so  that  fl/,n  +  i  =  ±1,  and  (2.1)  is  the  statement  of  a  two- 
class,  m-pattern  problem  in  which  the  augmented  patterns  of 
one  class  have  been  multiplied  by  - 1 .  The  Haar  condition  im¬ 
plies  that  the  patterns  are  in  general  position  [  1  ] . 

Two  pattern  classes  are  said  to  be  linearly  separable  if  there 
exists  a  w  with  the  property  that  Aw  >  0.  (As  indicated  below 
a  w  that  satisfies  (2.1)  can  be  displaced  so  that  it  also  satisfies 
the  strict  inequalities  Aw  >  0).  There  exist  a  number  of  well- 
known  algorithms  for  finding  a  solution  weight  vector  when 
the  classes  are  separable  [1] .  In  the  inseparable  case,  we  are 
interested  in  finding  a  weight  vector  that  is  optimal  in  the 
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lense  that  it  satisfies  the  largest  possible  number  of  row  in¬ 
equalities  in  (2.1),  and  thus  minimizes  the  number  of  patterns 
that  are  misdassified. 

Each  row  vector  at  of  A  determines  a  hyperplane  in  R"  * 1 : 

- iv -0}  (22) 

for/»l,2t***,Jti,and 
*♦1 

•i  »M  T  (2-3) 

/-I 

Each  hyperplane  //j  is  in  n-dimensional  subspace  of  Rn*‘ 
passing  through  the  origin,  and  (22)  also  indicates  that  Ht  is 
the  n-dimensional  orthogonal  complement  of  at. 

Since  Hi  bifurcates  A”*1,  there  is  a  quartet  of  open  and 
dosed  half-spaces  corresponding  to  each  hyperplane.  They  are 
denoted  by 

S,„  •  {wGR"*x  \a,-  w >0} 

$.-{»€*"**  |«'»<0} 

*  {weR"*1 1«,  •  iv < 0}.  (2.4) 

It  is  easily  demonstrated  that  each  of  these  half-spaces  is  con¬ 
vex,  and  that  the  intersection  of  an  arbitrary  collection  of  con¬ 
vex  sets  is  itself  convex.  It  is  also  noted  that  Hi  is  the  bound¬ 
ary  (or  frontier)  of  each  of  the  half-spaces  defined  in  (2.4). 

A  convex  polyhedral  set  is  defined  as  the  intersection  of  a 
finite  number  of  closed  half-spaces.  Furthermore,  because 
they  are  closed  under  addition  and  nonnegative  scalar  multipli¬ 
cation,  the  partitions  of/?”*1  generated  by  the  closed  half- 
spaces  in  (2.4)  also  satisfy  the  definition  of  convex  cones 
[13]  -[15] .  Therefore,  the  partitioning  of/?"*1  by  {Ht\ix  1, 
2,  •  ■  •  ,m)  establishes  a  finite  set  of  con  vex  polyhedral  cones. 
Each  of  these  cones,  generated  by  a  finite  number  of  support¬ 
ing  hyperplanes,  contains  the  origin,  is  nonempty,  dosed,  and 
unbounded  along  its  principal  axis.  The  boundary  of  a  cone, 
formed  by  sections  of  its  supporting  hyperplanes,  is  called  the 
frontier  of  the  cone.  The  intersection  of  n  hyperplanes  define 
an  edge  on  the  frontier  of  a  cone. 

The  concepts  introduced  in  the  above  discussion  are  illus¬ 
trated  in  Fig.  1 .  It  is  noted  that  a  vector  tv  contained  in  the  in¬ 
terior  of  a  cone  would  yield  strict  inequalities,  while  a  vector 
contained  in  an  edge  would  yield  a  zero  inner  product  with  all 
the  hyperplanes  that  define  that  edge.  It  is  shown  in  [5]  that 
displacing  an  edge  vector  into  the  interior  of  a  cone  without 
changing  the  sense  of  the  strict  inequalities  is  a  simple  matter 
when  A  satisfies  the  Haar  condition.  It  is  also  illustrated  in 
Fig.  1  that  every  cone  C  in  R"  * 1  has  an  image,  denoted  by  C~, 
about  the  origin.  If  the  number  of  inequalities  satisfied  by  a 
vector  w  in  C  is  less  than  or  equal  to  km ,  where 

.  m/2  for  m  even 

I  ■  .  f2_51 

(m+l)/2  for  modd. 


Fig.  1.  Illustration  of  a  three-dimensional  convex  polyhedral  cone,  its 
image,  frontier,  and  edges. 


then  the  number  of  inequalities  satisfied  by  the  image  of  w 
0x.,  w*  =  -tv)  is  greater  than  or  equal  to  km . 

III.  Development  of  the  Algorithm 

In  the  following  discussion  we  make  extensive  use  of  the  in¬ 
dex  set  /(tv)  of  a  vector  tv  in  R"  * 1 ,  which  is  defined  as  the  set 
of  integer  values  between  1  and  m  such  that  i  is  in  /(tv)  if 
•  tv  <  0,  where  at  is  the  i  th  row  vector  of  A .  The  error  of  tv, 
denoted  by  err(tv),  is  defined  as  the  cardinality  of  /(tv);  in 
other  words,  the  number  of  values  of  i  for  which  at  ■  tv  <  0.  In 
order  to  simplify  the  notation,  and  since  we  are  interested 
only  in  optimal  solutions,  it  will  be  assumed  throughout  that 
a  vector  tv  will  be  replaced  by  -tv  if  err(-iv)'<  err(tv).  We  be¬ 
gin  the  development  with  the  following  lemmas. 

A.  Two  Basic  Lemmas 

Lemma  1:  Let  tv  be  a  nonzero  vector  in  RH  * 1 .  Then  either 
tv  is  an  optimal  solution  of  (2.1)  or  one  of  the  hyperplanes  Ht, 
/  e/(tv),  contains  an  optimal  solution  of  (2.1). 

Proof:  Let  t  by  any  optimal  solution  of  (2.1)  and  let  L  be 
a  straight  line  segment  extending  from  tv  to  z.  We  will  show 
that  either  tv  is  an  optimal  solution  of  (2. 1)  or  L  n  Hi  is  an  op¬ 
timal  solution  for  some  r€/(tv),  which  is  sufficient  to  prove 
the  lemma. 

If » v  =  cz  for  some  c  <  0  then  -tv  is  an  optimal  solution  and 
we  replace  -tv  by  tv.  If  tv  ^  cs  for  c  <  0,  then  L  does  not  pass 
through  the  origin.  In  this  case  the  set  of  optimal  solutions  ly¬ 
ing  on  L  consists  of  a  series  of  one  or  more  subsegments  of  L, 
on  each  of  which  the  same  number  of  inequalities  of  (2.1)  is 
satisfied  by  each  vector  on  that  subsegment.  Consider  the  sub- 
segment  containing  z.  One  endpoint  of  this  subsegment  is  z. 
Let  the  other  endpoint  be  denoted  by  v,  which  is  not  0  since 
L  does  not  pass  through  the  origin  and  is  optimal  because  it  is 
on  the  subsegment  containing  z.  If  w  =  o,  then  iv  is  an  optimal 


CLARK  AND  GONZALEZ:  SOLUTION  OF  LINEAR  INEQUALITIES 


64S 


solution.  Since  v  is  an  optimal  solution  and  it  is  also  the  end¬ 
point  of  an  optimal  subsegment  it  follows  that,  if  w  &  v,  then 
for  some  i.vGL  CiHi  and  a,  •  w  <  0;  that  is,  *v  is  on  the  other 
tide  of  the  hyperplane  defining  the  end  of  the  optimal  subseg¬ 
ment.  Since  at-w<0,  then  we  have  i £ /(w).  This  concludes 
the  proof.  □ 

In  the  following  discussion  it  will  be  useful  to  consider  the 
notion  of  a  minimum-error  solution  relative  to  a  subspace  S  of 
H"  4 1 .  By  this  we  mean  a  nonzero  vector  c  contained  in  S  and 
with  the  property  err(v)  <  err(w)  for  all  nonzero  wGS.  It  is 
noted  that  a  minimum-error  solution  of  (2.1)  is  a  minimum- 
error  solution  relative  to  Rn  * 1 . 

Lemma  2:  Let  w  be  a  nonzero  vector  in  S,  a  subspace  of 
RH*t.  Then  either  w  is  an  optimal  solution  of  (2.1)  relative  to 
S,  or  at  least  one  of  the  subspaces  Sr\HltiG  I(w),  contains  an 
optimal  solution  of  (2.1)  relative  to  S. 

Proof:  The  proof  is  analogous  to  that  of  Lemma  1 .  We 
let  x  be  any  optimal  solution  relative  to  S  and  L  the  straight 
line  segment  extending  from  x  to  w.  Then  L  is  contained  in  S 
and  the  proof  proceeds  as  before,  but  with  the  words  “optimal 
solution”  replaced  by  “optimal  solution  relative  to  S."  □ 

Given  a  nonzero  vector  w  in  Rn  * 1  and  a  hyperplane  Ht,  i  € 
l(w),  if  x  is  an  optimal  solution  of  (2.1)  that  lies  in  Hit  then 
x  is  also  optimal  relative  to  Ht.  Hence,  Lemma  1  implies  that 
if  a  set  of  vectors  £  contains  w  and  at  least  one  relative  opti¬ 
mal  vector  for  each  hyperplane  Ht,  i  €  /(w),  then  B  contains  at 
least  an  optimal  vector  relative  to  Rn  * 1 .  Thus,  the  search  for 
an  optimal  vector  can  be  reduced  to  a  search  for  relative  opti¬ 
mal  vectors  for  each  of  the  subspaces  Hh  i  €  I(w).  This  search 
for  relative  optimal  vectors  will  be  guided  by  the  concepts  es¬ 
tablished  in  Lemma  2. 

B.  Search  Trees 

The  search  for  an  optimal  vector  may  be  expressed  in  terms 
of  a  tree  diagram.  In  order  to  illustrate  this,  assume  that  n  + 
1=4  and  that  we  begin  the  search  with  a  vector  tv(l)  lying  in 
the  intersection  of  hyperplanes  Hx,  Hit  H}-,  that  is,  w(l) £ 
H i  n  Ht  n//,.  If,  upon  performing  the  products  «/  •  *v(l), 
#*1,2,***  ,m,  we  find  that  w(l)  lies  on  the  negative  side  of 
hyperplanes  //*,//*,//*,  and  H9,  then/ [w(l)]  =  {4,6, 8, 9}. 
This  information  is  summarized  in  the  following  tree  diagram. 


0,2,3) 


where  (1,2,3)  specifies  the  hyperplanes  determining  the  start¬ 
ing  edge  and  each  labeled  branch  represents  a  subspace  (hyper¬ 
plane)  to  be  searched  for  a  relative  optimal  solution.  Once  rel¬ 
ative  optimal  solutions  are  found  for  H4,  Ht,  Ht,  and  //, , 
Lemma  2  guarantees  that  either  w(l),  or  at  least  one  of  these 
relative  optimal  solutions,  is  an  optimal  solution  of  (2.1). 

In  order  to  search  H4  we  apply  Lemma  2,  which  requires 
that  we  find  a  vector  lying  on  the  subspace  S  =  H4.  This  can 
easily  be  achieved  by  exchanging  Ht  for//*  so  that  the  vector 
will  lie  in  H4  n//2  n//,.  Let  us  denote  this  vector  by  w(2) 


w'l) 


Fig.  2.  A  simple  search  tree  after  computation  of  two  edge  vectors. 

and  assume,  for  example,  that  /[*v(2)J  =  {11,  IS}.  Then, 
Lemma  2  indicates  that  the  search  for  an  optimal  solution  rela¬ 
tive  to  H4  may  be  reduced  to  that  of  searching  the  subspaces 
Ha  ft/fit  and  H4  O //„.  The  status  of  the  search  is  summa¬ 
rized  in  Fig.  2.  It  is  noted  that  the  dimensionality  of  the  sub¬ 
spaces  to  be  searched  has  been  reduced  by  one;  in  other  words, 
we  started  searching  H4  (an  n-dimensional  subspace)  and  the 
problem  now  is  one  of  searching  the  subspaces  H4  n//,,  and 
H4T>H is  which  are  (n  -  l)-dimensional. 

In  order  to  find  an  edge  on  //«  n//„ ,  we  can  replace  Hi  to 
obtain  w( 3)  £//*  O//,,  n//3.  Suppose  that  /{*v(3)]  =  {6, 
12).  According  to  Lemma  2,  the  search  for  an  optimal  solu¬ 
tion  relative  to  H4C\HU  is  reduced  to  that  of  searching  the 
subspaces  H4  OH„  C\H4  and  H4  r\Hu  r\Htl.  In  our  ex¬ 
ample,  these  are  one-dimensional  subspaces  since  n  +  1  =  4. 
Thus,  the  problem  at  this  point  is  simply  that  of  finding  a  vec¬ 
tor  in  H4  C\HU  C\H4  and  a  vector  in  H4  fi//n  n//n.  Let  us 
denote  these  vectors  by  p(  and  v2 . 

In  order  to  continue  the  search,  we  find  »v(4)  £  H4  0//,5  n 
H3.  Suppose  that  /[w(4)J  =  {5,  7,  10}.  Following  an  argu¬ 
ment  identical  to  the  one  just  given  for  w(3),  we  would  find 
the  problem  reduced  to  that  of  finding  three  vectors  lying  in 
H4  C\HX s  n//s,  H4  n//,s  O //,,  and  H4  nz/15  nw10l  re¬ 
spectively.  Let  us  denote  these  vectors  by  o3,o4,  and  o6 .  The 
initial  problem  of  finding  an  optimal  solution  relative  to  H4 
has  now  been  reduced  to  that  of  selecting  from  among  the  vec¬ 
tors  o,  through  Vj  the  one  with  the  lowest  error.  The  search 
up  to  this  point  is  summarized  in  Fig.  3.  It  is  noted  that  this 
completes  the  examination  of  subspace  H4  and  that  no  other 
vectors  contained  in  this  subspace  can  yield  a  lower  error  than 
the  best  vector  in  the  set  vx  through  vs .  Thus,  at  this  point  in 
the  search,  an  optimal  solution  relative  to  H4  has  been  found. 

In  order  to  continue  the  search,  we  would  next  consider 
another  hyperplane  from  the  set  { Ht,H9,H9 }  and  repeat  the 
procedure  discussed  above  for  obtaining  a  relative  optimal  solu¬ 
tion.  When  all  these  hyperplanes  have  been  considered,  either 
w(l)  or  at  least  one  of  the  optimal  solutions  relative  to  H4, 
Ht,  Ht,  or  H9,  would  be  an  optimal  solution  to  (2.1).  It  is 
noted  that  the  number  of  levels  traversed  in  the  tree  in  order 
to  examine  each  hyperplane  for  a  relative  optimal  solution  is 
n  +  1 .  In  the  following  discussion  we  formally  define  the  con¬ 
cept  of  a  search  tree  and  prove  (in  Theorem  1)  that  a  search 
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will 


Fig.  3.  Search  tree  after  ru bipace  H*  hai  been  examined. 


tree  leads  to  an  optimal  solution  of  (2.1).  Procedures  for  re¬ 
ducing  the  size  of  a  search  tree  are  discussed  in  Sections  III-C 
tndlU-D. 

Let  w(l),  w(2),  •  •  • ,  w(p)  be  nonzero  vectors  in  K"  * 1 ,  and 
let  /(-/HO),  1*1,2, --'.p.  Assume  that  for  **2,3, 
•••,p,  we  have 

W(3fc)  e/f,,  n  Hh  n  nH,k_t  (3.1) 

for/,  €/,./,  €/,.•••./*-,  €/*-,.  We  then  define  a  search 
me  as  the  vectors  w(l),  w(2),  •  •  • ,  w(p)  together  with  the 
following  subspaces: 

Hh  /,  e/,,  (/,*/,). 

HhnHtj  /,€/„  (/,#/,). 

/,€/,,  (/,*/,) 

/p.,€/p.|,  (/p-i  ^/p-,)> 

//,,  o  n  •  •  •  n  //,p . ,  o  //,p  /p  €/p.  (3-2) 

The  above  set  of  vectors  and  subspaces  can  be  represented 
schematically  by  the  diagram  shown  in  Fig.  4.  We  may  think 
of  w(l),  w(2),  •  •  • ,  w(p)  as  "starting  vectors”  for  examining  a 
series  of  subspaces  for  relative  optimal  solutions.  Thus,  w(l) 
is  the  starting  vector  for  /?"*',  w(2)  the  starting  vector  for 
//,,,  w(3)  the  starting  vector  for//,,  and  so  on.  These 

vectors  play  the  role  of  w  in  Lemmas  1  and  2.  The  top-level 
branches  in  Fig.  4  represent  the  subspaces  //,,  i  6 1, ,  the 
second-level  branches  the  subspaces  H,t  nHh  /€/,,  and  so 
on.  According  to  Lemmas  1  and  2,  these  are  the  subspaces 
that  must  be  examined  for  their  relative  optimal  solutions  in 
order  to  obtain  relative  optimal  solutions  for  /?"**,  //,,, 
Hi ,n  For  clarity  we  have  not  labeled  all  the 

branches  in  the  diagram  of  Fig.  4,  but  have  instead  indicated 
the  index  sets  from  which  these  labels  would  come. 

The  following  theorem  generalizes  Lemma  2  and  establishes 
that  a  search  tree  leads  to  an  optimal  solution  of  (2.1). 

Theorem  l;  Let  w(l),  w(2),  •  •  • ,  w(p)  satisfy  (3.1)  and  let 
Tp  be  the  union  of  all  the  subspaces  in  (3.2).  Then,  there  is  at 


*u> 


Fig.  4.  Schematic  diagram  of  a  general  search  tree. 


least  one  optimal  vector  relative  to  R”  * 1  in  the  set 

rpU{w(*)|**l,2,---,p}  (33) 

forpM,  2,  **•,«. 

Proof:  Consider  the  following  subspaces: 

Gp-i=u{tflln---n/flp_l 

n/f,p_,|/p_,  €/p_,,/p.|  ^/p.,} 

QP  -  u  {//,,  n  •  •  •  n  //lp . ,  n  h,p \ip  <=  /p}  (3.4) 

where,  in  the  last  expression,  inclusion  of  the  “''"means  that 
ip  is  allowed  to  equal  lp.  It  then  follows  from  the  statement 
of  the  theorem  that 

rp=GiUG,U---UGp.  (33) 

For  p  *  1 ,  7",  =  U  {///,  |i,  €  /,}  and  the  theorem  reduces  to 
Lemma  1.  For  p  >  1  we  use  Lemma  2  and  induction  on  p. 
Assume  that  the  theorem  is  true  for  p  =  q;  we  then  wish  to 
prove  that  it  is  also  true  for  p  =  q  +  1.  In  other  words,  we  as¬ 
sume  that 

f,  U  (w(l),  *v(2),  •  •  • ,  wfa)}  (3.6) 

contains  at  least  one  optimal  solution  and  we  wish  to  prove 
that  this  is  also  true  for 

r, ♦ ,  U  {w(l ),  w( 2),  •  •  • ,  wfa),  »(q  +  1)}.  (3.7) 

From  the  above  use  of  the  symbol  we  have  that 

Qq  «  Qq  U  {Hi ,  n  Hh  n  •  •  •  n  H,q _ ,  n  //„  |  /,  « /,} 

-G,us  (33) 


138 


CLARK  AND  GONZALEZ:  SOLUTION  OF  LINEAR  INEQUALITIES  647 


wUl 


Fig.  5.  Search  tree  diagram  used  in  the  proof  of  Theorem  1.  The  cir¬ 
cled  branches  in  each  level  represent  the  subspices  (hyperplanes)  used 
in  forming  the  set  Q  at  that  level. 

where 

s*H,tr\Hhn-nH,q_tnH,q.  (3.9) 

The  situation  is  shown  in  Fig.  5,  in  which  Q, ,  Q2  ,Qq  are 
the  subspaces  formed  by  the  unionof  hyperplanes  represented 
by  the  circled  branches.  To  form  Qq  we  simply  include  hyper- 
plane  Hlq  in  the  union  of  the  hyperplanes  forming  Qq. 

It  is  noted  in  Fig.  5  that  w(q  +  1)  €  n  /f;i  n  ■  •  • 
C\Hlq.  In  other  words,  w(q  +  1)€ J.  It  then  fol¬ 
lows  from  Lemma  2  that  either  w(q  + 1)  is  an  optimal  solution 
relative  to  S  or  there  is  at  least  one  optimal  solution  relative  to 
S  in  the  subspaces  Sn/f^+ft  Representing 

these  subspaces  by  U  {5  n  ff,  +  ( \iq + ,  6  fq + , },  we  note  that 

u  (S ' n »<,♦ ,!«.♦ «  € l** «>  * &♦. •  (310) 

Since 

Tq  =  Qi  UG,U •••UG,. 

r,4l-eluG,u--uG,ue(J*l, 

A 

and  Qq*Qq  -  S  (where  represents  set  subtraction)  it  fol¬ 
lows  that 

Tq.,  *(Tq-S)UQq  ♦,  (3.11) 

so  that 

Tq*i  *->  {»w(l).  w(2), •  •  • , w(q), w(q  +  1)} 

•[(Tq-S)u{w(l),w(2),-  -Md)}]  ' 

U[Q,*,Uw(?tl)].  (3.12) 


In  order  to  finish  the  proof,  we  only  need  to  show  that  either 
of  the  subspaces  [(7*,  -  S)  U  {w(l),  w(2),  •  •  • ,  w(q)]  or 
[Gq+i  U  w(q  +  1)]  contains  an  optimal  solution  relative  to 
Rn* 1 .  Letting  z  represent  an  optimal  solution  relative  to  S, 
we  know^  from  Lemma  2  that  either  z  *  w(q  +  1)  or  *  6  Qq  * , ; 
that  is,  Qq  + 1  U  w(q  -I- 1)  contains  an  optimal  solution  relative 
to  S.  If  z  is  also  an  optimal  solution  relative  to  RH  * 1  we  are 
finished  with  the  proof.  If  this  is  not  the  case,  then  S  does  not 
contain  an  optimal  solution  relative  to  Rn  * 1  and  it  may  be  de¬ 
leted  from  further  consideration  in  the  proof,  leaving  the  sub¬ 
space  Tq  U  {h>(1),  w(2),  ■  ■  •  ,w(q)}.  However,  we  know 
from  the  induction  hypothesis  that  this  subspace  contains  at 
least  one  optimal  solution  relative  to  R"*1.  This  concludes 
the  proof.  □ 

C  Reduced  Search  Trees 

The  number  of  branches  that  are  investigated  in  a  search  tree 
can  be  reduced  by  keeping  a  record  of  the  subspaces  that  have 
already  been  searched.  This  will  eliminate  computation  of  the 
same  information  more  than  once  and  thus  reduce  the  time  re¬ 
quired  to  complete  the  search  for  an  optimal  solution.  In  this 
section  we  consider  techniques  for  reducing  search  trees  and 
prove  that  a  reduced  search  tree  will  lead  to  an  optimal  solu¬ 
tion  of  (2.1). 

Let  w(l),  w(2),  •  •  • ,  w(p)  be  nonzero  vectors  in  Rn  * 1 ,  and 
let  I[,/2,‘",Ip  and  /|,  /2,  •  •  •  ,/p  be  sets  of  integers  be¬ 
tween  1  and  m  satisfying  the  following  conditions: 


/;c/„  /  *  i,2,  •  •  •  ,p 

(3.13) 

//n/y»p,  f  *  1,2,  •  •  •  ,P 

/■ 1,2  ,*••,/ 

(3.14) 

/,c(/;u/,) 

A  c  (/j  uy,  U/,) 

(3.15) 

/,c(/;u;,u-u/,). 

A  reduced  search  tree  is  defined  as  a  set  of  vectors  w(k),  k  ■  1 , 
2,  —  ,p  satisfying  (3.1)  for  some  values  /,  € l\ ,  l2  e I2 ,  •  •  • , 
lp  _ ,  €  Ip  . ,  together  with  subspaces  of  the  form 

Hit  /.e/Ju/,,/,*/, 

tf/.n/f,,  i2en  U/,./,  */, 

Hix  i 

6 /p.,  U/p . | ,  ip -i  ’tip -i 

//fln-n//lp.(n//<p  ipSipUj,  (3.16) 

where  l'k  and/*, k  -  1 , 2,  •  •  • , p,  satisfy  (3.13)-(3.15). 

The  diagram  of  the  search  tree  just  defined  is  shown  in  Fig. 
6.  The  interpretation  given  to  the  sets  l‘k  and  Jk  is  that  ele* 
|  ments  of  Jk  indicate  subspaces  (shown  as  dashed  branches) 
I  already  examined  for  a  relative  optimal  solution,  while  ele¬ 
ments  of  lk  indicate  subspaces  yet  to  be  examined,  or  in  the 
process  of  being  examined.  These  subspaces  are  denoted  by 
solid  branches  in  Fig.  6.  Condition  (3.t3)  indicates  that  in- 
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dexes  corresponding  to  subspaces  to  be  examined  are  elements 
of  a  valid  index  set.  Condition  (3.14)  indicates  that  we  need 
not  examine  any  subspace  of  a  subspace  already  examined. 
Condition  (3.1S)  is  a  requirement  that  we  examine  each  of  the 
subspaces  corresponding  to  the  index  set  /<  that  have  not  al¬ 
ready  been  examined.  It  is  noted  that  the  set  //  (which  gives 
the  indexes  of  hyperplanes  to  be  considered  at  the  ith  level)  is 
formed  by  deleting  from  It  any  indexes  contained  in  Jj,  j  ~ 
12-"/ 

Corollary  1:  Let  *v(l),  w(2),  •  •  • ,  w(p)  satisfy  (3.1)  for 
!,€/{,  /j  €/i,  •••,/p  e/p,  where  /{,  /J,  •••,/p  satisfy 
(3.13),  and  let  Vp  be  the  union  of  all  the  sets  in  (3.16),  where 
I[,  H,  •••./;.  /a.--*./,,  satisfy  (3.14)  and  (3.15). 

Then,  there  is  at  least  one  optimal  solution  vector  in  the  set 

FpU{w(*)|*=  1,2,  •••,?}.  (3.17) 

Proof:  The  proof  follows  from  Theorem  1  by  noting  that 
/,  C(/|  U/1),*‘*,/pC(/pU/1  U-U/p)  and  that  Tp  C 
Vp.  □ 

Another  important  property  of  search  trees  that  leads  to 
further  reductions  in  computation  is  that  any  of  the  starting 
vectors  and  its  index  set  may  be  replaced  by  a  bottom-level 
starting  vector  and  its  index  set.  The  result  will  still  be  a 
search  tree  that  satisfies  Theorem  1  and  its  corollary. 

In  order  to  illustrate  how  the  condition  given  in  (3.14), 
along  with  the  above  replacement  procedure,  can  be  used  to 
reduce  the  search  for  an  optimal  solution,  consider  Fig.  7(a) 
which  shows  a  tree  at  some  stage  of  a  hypothetical  search. 
The  branches  taken  from  left  to  right  and  top  to  bottom  repre¬ 
sent,  respectively, 

ha  n  //,,//«  n //*,//«  n  //,,//,  n  //, 

and  the  dashed  branches  represent  subspaces  that  have  already 
been  investigated  for  relative  optimal  vectors.  It  is  noted  that 


the  index  sets  of  subspaces  to  be  investigated  (i.e.,  I[  and  /j) 
do  not  contain  the  indexes  of  subspaces  already  searched  at,  or 
above  the  second  level  of  the  tree,  as  indicated  in  condition 
(3.14).  Suppose  now  that  a  new  vector,  u  =  w(3),  lying  in 
Ha  n  Hi  has  been  computed  and  its  index  set  is  I3  -  {1, 8, 
10),  as  shown  in  Fig.  7(b).  In  order  to  continue  the  search 
using  this  new  vector,  we  have  the  three  possibilities  shown  in 
Fig.  7(c)-(e).  In  Fig.  7(c)  we  simply  leave  *v(3)  in  position 
and  delete  the  branches  labeled  1  and  8  because  they  represent 
subspaces  that  were  already  investigated  at  a  higher  level  in  the 
tree.  In  Fig.  7(d),  h>(2)  and  its  descendants  were  replaced  by  u 
and  its  descendants,  deleting  at  this  level  any  dashed  branches 
that  appear  at  a  higher  level  (i.e.,  the  branch  labeled  1  in  this 
case).  It  is  noted  that  any  dashed  branches  that  do  not  satisfy 
this  condition  (i.e.,  the  branch  labeled  3)  are  retained  to  show 
later  in  the  search  that  they  have  been  investigated  at  that  level 
of  the  tree.  Finally,  Fig.  7(e)  shows  the  entire  tree  replaced  by 
u  and  its  descendants.  At  this  level  only  Ht  has  been  investi¬ 
gated  so  the  dashed  branch  labeled  1  is  retained  and  the 
branches  labeled  8  and  10  are  solid,  indicating  that  they  still 
have  to  be  searched  for  a  relative  optimal  solution.  It  is  noted 
that  all  three  possibilities  in  Fig.  7(c)-(e)  satisfy  the  condi¬ 
tions  of  Theorem  1  and  its  corollary;  thus,  either  of  the  three 
choices  to  continue  the  search  will  lead  to  an  optimal  solution. 
Our  goal  is  to  choose  the  candidate  with  the  most  potential  for 
reducing  the  search.  The  criterion  we  will  use  is  to  place  u  at 
the  highest  possible  level  in  the  tree  such  that  the  number  of 
solid  branches  at  that  level  is  less  than  before.  This  criterion 
seeks  to  reduce  the  search  by  trimming  off,  at  the  highest  pos¬ 
sible  level,  subspaces  that  would  have  been  investigated  in  the 
original  tree.  We  are  thus  lead  to  the  following  rule. 

Replacement  and  Deletion  Rule:  Let  u  =  w(r)  with  original 
index  set  /(h)  be  a  vector  computed  at  the  bottom  level  r  of  a 
search  tree.  For  values  q  ■  1, 2,  •  •  •  ,r  -  1 ,  we  let 

/^(h)  =  /(h)  -  {/,  U /,  U  •  •  •  U /,}  (3.18) 

and 

/,(«)  =  /,  (3-19) 

where  ”  indicates  set  subtraction.  We  then  choose  the  small¬ 
est  value  of  q,  if  any,  for  which  cardinality  [l'q  (u)]  <  cardinal¬ 
ity  [/<jl ,  and  replace  w(q)  and  its  index  sets  by  u  and  the  in¬ 
dex  sets  given  in  (3.18)  and  (3.19),  deleting  all  descendants  of 
tv(q).  If  no  such  value  of  q  exists,  no  replacement  takes  place 
and  the  index  sets  at  the  rth  level  are  given  by 

/,'  =  /(h)-(/,U/,U-U/m}  (3.20) 

and 

(3.21) 

It  is  noted  that  any  indexes  of  subspaces  already  investigated 
at  a  higher  level  in  the  tree  are  deleted  from  /(h)  to  form  l'q  (u) 
and  that  Jq  (u)  retains  all  indexes  of  subspaces  already  investi¬ 
gated  at  level  q.  In  (321),  =  because  r  is  at  the  bottom 

level  of  the  search  tree  and  no  subspaces  have  yet  been  investi¬ 
gated  at  that  level. 

Returning  to  the  example  in  Fig.  7,  we  see  that  the  above 
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Fig.  7.  (a)  Search  tree,  (b)  New  vector  computed  in  Hk  n  H-,.  (c) 
through  (e).  The  three  possibilities  involving  u  «  w  (3)  in  the  con¬ 
tinuation  of  the  search. 


nile  would  choose  the  configuration  shown  in  Fig.  7(d)  to  con¬ 
tinue  the  search. 

D.  Ordering  of  the  Index  Sets 

Each  node  in  a  search  tree  represents  a  vector  established  by 
the  intersection  of  n  hyperplanes.  As  shown  in  Fig.  8,  a  vector 
at  the  4  th  level  in  the  tree  satisfies  the  requirement 

"(*)  €  H,x  n  Hh  n  •  •  •  n  H,k . ,  n  H,k  n  H,k  t ,  n  •  •  •  n  H,n 

(3-22) 

where  1  < k  <n  and  /<€//,/=  1, 2,  •  •  •  ,k  -  1.  As  indicated 
in  Section  III-B,  we  compute  a  vector  w(k  +  1)  at  the  next 
level  by  replacing  Htk  by  a  hyperplane  Hlk  with  ik  €  /*.  In  or¬ 
der  to  stress  the  dependence  of  w(k  +  1)  on  i*  we  will,  in  this 
section,  represent  these  vectors  by  w(k  +  1 ,  /*).  Then. 

W(4  + 1 ,  ik)e  h,x 

-nHtn  (3.23) 

for  some  /*€/*.  There  are  as  many  of  these  vectors  as  there 
are  elements  in  /*.  In  order  to  continue  the  search  we  may 
take  any  of  these,  obtain  ,  compute  w(k  +  2),  and  so  on. 


Fig.  8.  Illustration  of  a  vector  tv(*)  at  level  k  of  a  search  tree  and  the 
computation  of  w(k  *  1)  by  replacing  by  Hjk,  ik  e  /*. 

However,  since  we  are  seeking  a  minimum-error  solution,  it 
would  be  advantageous  to  be  able  to  select  the  w(&  +  1 ,  ik) 
j  with  the  smallest  error  as  the  next  candidate  in  the  search. 
I  The  brute-force  method  of  computing  all  w(4  +  1 ,  ik)  and 
choosing  the  best  one  would  be  in  general  unacceptable  in 
'view  of  the  fact  that  the  replacement  and  deletion  rule  will  in 
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Fig.  9.  Example  of  subspace  V k. 


many  cases  eliminate  the  consideration  of  some  of  these  vec¬ 
tors  in  the  first  place.  In  this  section  we  develop  a  technique 
for  obtaining  the  errors  of  all  w(k  +  1,  ik),  ik  €  l'k,  without 
actually  having  to  obtain  these  vectors  directly.  These  error 
values  can  then  be  used. to  rearrange  the  elements  of  I'k  so  that 
they  are  considered  in  order  of  increasing  error.  As  shown  be¬ 
low,  the  method  is  very  economical  from  a  computational 
point  of  view. 

The  subspace 

Vk=HllnHljn-nH,k_inHlk^n-nHln 

(3.24) 

is  two-dimensional  and  it  contains  both  w(k),  which  is  known, 
and  w(k  +  1,  /*),  ik  €  I'k,  which  are  unknown.  Let  z(k)  be  a 
vector  contained  in  Vk  and  orthogonal  to  *v(k).  (The  vector 
z(k)  may  be  found,  for  example,  by  Gaussian  elimination.)  It 
then  follows  that  each  w(k  +  1 ,  ik)  may  be  expressed  as  a  lin¬ 
ear  combination  of  w(k)  and  z(k). 

Fig.  9  illustrates  a  typical  geometrical  configuration  within 
the  two-dimensional  subspace  Vk.  A  vector  w(k  +  1 ,  ik)  lies  in 
the  intersection  (shown  as  a  dashed  line)  of  Hik  with  Vk,  and 
w(Jt)  lies  in  the  intersection  of  Hlk  with  Vk.  The  projection  of 
the  normal  to  Hlk  (see  Section  II)  onto  the  Vk  plane  is  shown 
with  coordinates  (cr( ,  a2 ),  where  alk  is  the  row  vector  of  A  de¬ 
termined  by  the  value  of  ik.  From  simple  geometrical  consid¬ 
erations  we  have  that 

(<*,.<*,)  =  K  •  H'(*)/H*)||,«,k  •  *(*)/||*(*)||)  (3-25) 

and 

^  =  -  cot  ©  (3.26) 

at 

where  6  is  the  angle  from  the  w(k)  axis  counterclockwise  to 
the  dashed  half-line.  Finally,  we  define  the  quantity  y(j)  as 

7  (j)  =  arw(k)/arz(k) 

*  -  Ccot  ©  (3.27) 

where  C=  ||w(*)||/||z(*)||. 


Based  on  the  foregoing  concepts,  we  define  the  following 
ordering  rule. 

1)  Let  D  be  the  set  of  indexes  of  the  hyperplanes  determin¬ 
ing  w(Jt);  that  is,  from  (3.22),  D  *  {/j ,  /j ,  •  •  ■ ,  /„}.  A  vector 
r v(k  +l,/t)  lies  on  the  hyperplanes  with  indexes  in  this  set 
(except  H,k)  since  Hh ,  Hh ,  •  •  • ,  H,k  . , ,  H,k  4 , ,  •  •  • ,  H,„  de¬ 
fine  Vk. 

2)  Let  b(k)be  given  by 


*(*)  = 


(: 


if  o,k-z(k)>0 
if  a,k-z(k)<  0. 


3)  For  each  ik  6/*,  let 


M(ik)  =  number  of  hyperplanes  Hj  for  which  y(j)  <  7 (/*), 
N(ik)  =  number  of  hyperplanes  Hq  for  which  y(q)  <  7  (ik), 


4)  For  each  ik  €  l'k,  compute  the  error  of  w(fc  +  1 ,  /*),  as 
follows: 


err  [w»(fc  +  1,/*)]  =  err  [w(fc)]  +M(ik)-(N(ik)+  1  )  +  b(k). 

5)  Rearrange  the  elements  of  I'k  in  order  of  increasing  error. 
Elements  with  the  same  error  are  ordered  arbitrarily. 

As  an  illustration  of  the  ordering  rule,  consider  a  problem  in 
which  m  =  10,  n  -  4,  and  suppose  that  we  are  at  the  second 
level  in  the  search  tree  with  w(2)  =  Hi  n//4  n//g  C\Hl0, 
It  =  (3.  5,  6),  and  l'i  -  {5,  6}.  In  this  case  V7  =H2  O  Ht  n 
Hl0,  D  =  {2,  4,  8,  10),  and  suppose  that  a4  -z( 2)>0.  The 
situation  is  shown  in  Fig.  10,  where  the  hyperplanes  with  in¬ 
dexes  in  l'i  are  circled.  Note  that  and  Hl0  are  not 

shown  because  they  define  V2 .  By  definition,  *v(2)  lies  on  the 
negative  side  of  the  hyperplanes  with  indexes  in  /j  and  on  the 
positive  side  of  all  other  hyperplanes  with  indexes  not  in  D. 
Since  a  -x(2)>0,  z(2)  lies  on  the  positive  side  of  Hk  ft  V2. 
The  error  of  w(2)  is  3  because  l2  contains  three  elements.  We 
also  note  that,  since  -  cot  6  is  an  increasing  function  of  6  for 
0  < ©  <  rr,  7(0  <  7(S) < 7(9)  <  7(3)  <  7(6)  <  7(7).  Thus, 
to  compute  err  (w(3,5)]  we  first  compute  Af(S)=  1  and 
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N( 5)  =  0.  Then,  since  b(2 )  =  0,  it  follows  from  step  4)  that 
err  [h>(3,  5)]  =  3.  Similarly,  Af  (6)  =  2,N(6)  -  2,  and  err  [»v(3, 
6)]  =  2.  Thus,  the  ordered  index  set  becomes  =  {6, 5}. 

The  key  to  the  above  procedure  lies  in  the  fact  that  -C  cot  8 
is  an  increasing  function  of  0  for  0  < 0  <  ir  and,  thus,  can  be 
used  to  determine  the  positive  and  negative  side  of  any  hyper¬ 
plane  with  respect  to  a  vector  contained  in  the  one-dimensional 
subspace  H,k  n  Vk  and  oriented  in  the  0  <  8  <  ir  direction. 
For  instance,  in  the  example  just  described,  it  is  noted  that 
A/(5)  gives  the  number  of  hyperplanes  for  which  y(j)  <  y(5), 
/  £  /*  U  D\  that  is,  A/( 5)  is  the  number  of  hyperplanes  whose 
intersection  with  are  to  the  right  of  a  vector  contained  in 
Hsnv2,  excluding  hyperplanes  with  indexes  in  1%  and  D. 
Since  these  exclusions  guarantee  that  all  hyperplanes  used  in 
the  computation  of  Af(5)  have  their  positive  side  facing  w(2), 
it  follows  that  the  intersection  of  any  of  these  hyperpia.ies 
which  is  to  the  right  of  a  vector  contained  in  Hs  D  will 
yield  an  error  (negative  product)  with  respect  to  a  vector 
w(k  +  1 ,  4),  k  -  2 ,  ik  -  5 ,  contained  in  that  one-dimensional 
subspace  and  oriented  in  the  0  <8<ir  direction.  In  the 
above  example  only  one  hyperplane,  ,  satisfies  the  condi¬ 
tions  necessary  for  use  in  the  computation  of  Af( 5),  yielding 
Af( 5)  =  1 .  Similarly,  N(ik)  is  the  number  of  hyperplanes  whose 
intersections  with  Vk  are  to  the  right  of  a  vector  w(k  +  1 ,  ik) 
in  H,k  n  Vk,  but  whose  indexes  are  in  Ik  and,  therefore,  have 
their  negative  sides  facing  Since  err  [**>(£)]  gives  the 

total  number  of  these  hyperplanes,  it  follows  that  the  quantity 
err [**»(£)]  -  N(ik)  -  1  is  the  error  with  respect  to  w(k  +  1,4) 
of  all  hyperplanes  with  indexes  in  /*.  (The  - 1  is  included  to 
reduce  err[w(k)J  by  one  because  Hlk  contributed  to  this  error 
since  ik  €  /*;  however,  H,k  contains  w(k  +1,4)  and,  there¬ 
fore,  does  not  contribute  to  err(w(k  +  1,4)1)-  In  the  present 
example  with  k  =  2  and  4  =  5  we  have  that  .V(5)  =  0. 

At  this  point  all  hyperplane  intersections  with  Vk,  except 
Hfk  O  Vk  (4  =  4),  have  been  taken  into  account.  In  order  to 
establish  the  contribution  of  Hlk  to  err[w(k+  1),  4)1  >  It  I* 
only  necessary  to  determine  the  positive  side  of  Htk  O  Vk  with 
respect  to  the  half  space  0  <  8  <  ir.  This  is  easily  accom¬ 
plished  by  using  z(k).  If  j,fc- z(k)  >  0,  the  positive  side  of 


Htk  n  Vk  faces  the  half  space  just  mentioned  and  Htk  does  not 
contribute  to  err[tv(A:  +  1, 4)1 that  is,  b(k)  =  0.  Otherwise, 
the  error  is  increased  by  one  by  letting  b(k)  =  1.  In  the  pres¬ 
ent  example  4  =  4  and  we  have  that  b(2)  =  0  because  it  was 
assumed  that  a4  -  z  (2)  >  0. 

With  reference  to  step  4)  of  the  ordering  rule,  and  based 
on  foregoing  discussion,  it  is  seen  that  all  contributions  to 
err  [w(k  +  1,  4)j  have  been  taken  into  account.  The  hyper- 
planes  with  indexes  in  D  were  not  considered  because,  with 
the  exception  of  H,k,  they  combine  with  Hik  to  form  the  edge 
containing  w(k  +  1,  4),  as  shown  in  (3.23).  As  indicated  in 
Section  II,  * v(k  +  1,  4)  can  be  displaced  so  that  it  yields  a 
positive  product  with  all  the  edge  hyperplanes,  so  they  would 
not  contribute  to  the  error  of  this  vector. 

E.  Statement  of  the  Algorithm 

The  concepts  developed  in  Sections  III-A-D  lead  to  the  fol¬ 
lowing  algorithm  for  finding  an  optimal  solution  of  (2.1). 

Notation: 

tv*  an  optimal  solution  of  (2.1)  at  the  termination  of  the 
algorithm 

k  tevel  in  the  search  tree 

Ek  set  of  indexes  of  n  hyperplanes  used  to  compute  an 
edge  vector  at  the  kth  level;  this  vector  can  be  com¬ 
puted  by  Gaussian  elimination  or  by  using  the  method 
discussed  in  [5]  ■ 

w(k)  edge  vector  computed  at  level  k 

Ik  index  set  of  w(k) 

Jk  index  set  of  subspaces  already  examined  at  level  k 

l'k  index  set  of  subspaces  to  be  examined  at  level  k 

0  empty  set 

Step  1  ( Initialization ):  Let 

a)  tv*  =  arbitrary  starting  vector1; 

b)  J/  *  0 ,  /  =  1 , 2,  •  •  • ,  n  -  1 ; 

1  An  alternative  is  to  start  with  a  quasi-optimal  vector  determined,  for 
example,  by  a  procedure  such  as  Fisher’s  1 3 1 .  We  have  found,  however, 
that  progress  toward  an  optimal  solution  is  at  Fust  rapid,  thus  partially 
negating  any  advantage  that  may  be  gained  by  using  additional  (and 
more  complex)  techniques  to  estimate  a  “better”  starting  vector. 
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c)  Ei  =  {1,2 that  is,  select  the  first  n  hyper¬ 
planes  to  compute  w(l); 

d)  *  =  1. 

Step  2: 

a)  Compute  w(k)  using  the  hyperplanes  with  indexes  in 

**• 

b)  Compute/*. 

c)  If  err[w(k)]  < err(w*),  let  w*  =  w(k). 

d)  Iferr(tv(ifc)j  =  0,  go  to  Step  9. 

e)  Iffc  =  n  +  1,  set  it  =  n  and  go  to  Step  8. 

Step  3:  Apply  the  replacement  and  deletion  rule,  denoting 
by  q  the  level  at  which  replacement  took  place.  The  index  sets 
Iq  and  Jq  are  given  by  (3.18)  and  (3.19). 

Step  4: 

a)  Let  p  =  k,kaq,Tk=Iq  and/*  =Jq. 

b)  Let  //  *  0  for  all  /  >  k. 

c)  If  Ik  -  <t> ,  go  to  Step  7.  Otherwise,  rotate  the  elements 
in  Ep  starting  in  the  Ath  position  so  that  the  element  in  the 
pth  position  goes  into  the  Arth  position.  The  elements  to  the 
left  of  the  fcth  position  are  not  disturbed.  Replace  the  ele¬ 
ments  in  £*  by  the  elements  in  Ep  after  rotation.1 

Step  5: 

a)  Rearrange  the  elements  of  /*  in  order  of  increasing  er¬ 
ror  by  applying  the  ordering  rule.  Let  emia  be  the  minimum 
error  found. 

b)  Let/3  1. 

Step  6: 

a)  If  k  -  n  and  emin  >  err(w*),  go  to  Step  8.  Otherwise, 
continue. 

b)  Let  £*«.,  =£*  and  then  replace  the  Jfcth  element  of 
£*♦,  by  the  /th  element  of  /* . 

c)  Increment  k  and  /  by  1 . 

d)  Go  to  Step  2. 

Step  7:  If  k  -  1,  go  to  Step  9.  Otherwise,  set  /*  =  <f>  and 
continue. 

Step  8: 

a)  Decrement  k  by  1 . 

b)  Transfer  the  /th  element  of  /*  to  /*  to  indicate  that 
another  subspace  has  been  searched  at  level  k. 

c)  If  the/th  element  was  the  last  element  of  /*,  go  to  Step 
7.  Otherwise,  go  to  Step  6. 

Step  9:  Stop  with  edge  vector  w*  as  an  optimal  solution  of 
(2.1).  Displace  w*  into  the  interior  of  the  optimal  cone  by 
using  the  procedure  described  in  [5] . 

IV.  Bounds  on  the  Number  of  Edges 
Investigated  by  the  Algorithm 

An  exhaustive  search  for  an  optimal  solution  carried  out  by 
obtaining  one  vector  in  each  cone  edge  would  require  the  com¬ 
putation  of  Cflm  such  vectors.  For  algorithms  employing  the 

,For  example,  suppose  k  •  2,  p  •  4,  and  f4  ■  {*,.  ea,  eJt  e4t  es}. 
We  sotate  the  elements  starting  in  the  second  position  so  that  the  cle¬ 
ment  in  the  fourth  position  goes  into  the  second  position.  The  ele¬ 
ments  to  the  left  of  the  second  position  are  not  disturbed.  After  rota¬ 
tion  and  letting  £a  *f4  we  then  have  £a  »  {«,,  <4, 4j,  e,,  e,}.  This 
Mep  updates  the  hyperplane  indexes  at  level  k  by  taking  into  account 
the  fact  that  a  vector  at  level  p  was  brought  up  to  level  k. 


concept  of  an  index  set  to  reduce  the  search,  it  has  been 
shown  [5]  that  the  ratio  Wm  JC™  is  sUictly  less  than  1, 
where  Wm>n  is  the  number  of  edge  vectors  computed  under 
worst  case  conditions  at  each  step  in  the  search.  Experimental 
results  [S]  indicate  that  the  actual  number  of  vectors  com¬ 
puted  in  a  search  can  be  expected  to  be  considerably  less  than 
the  theoretical  upper  bound. 

The  theoretical  upper  bound  derived  in  [5]  would  apply  to 
the  algorithm  developed  in  Section  III  (since  it  too  is  based  on 
the  use  of  an  index  set)  in  the  case  when  the  search  is  never 
restarted  and  none  of  the  index  sets  are  ordered.  When  use  is 
made  of  restarting  and  ordering  we  would  expect  the  number 
of  vectors  that  need  to  be  investigated  to  be  significantly  re¬ 
duced,  and  the  results  presented  in  the  next  section  bear  this 
out.  Aside  from  the  fact  that  a  th  oretical  upper  bound  can 
be  established  under  worst  case  conditions,  derivation  of  an 
upper  bound  that  takes  into  account  restarting  and  ordering 
does  not  appear  feasible  because  the  advantages  derived  from 
these  procedures  are  data  dependent.  It  is  possible,  however, 
to  obtain  an  expression  for  the  lower  bound  (as  a  function  of 
the  error  of  an  optimal  solution)  on  the  number  of  vectors 
that  must  be  investigated  by  the  algorithm.  This  result,  given 
as  a  corollary  of  the  following  theorem,  is  quite  useful  because 
it  establishes  a  guideline  for  the  minimum  amount  of  computa¬ 
tional  work  required  to  find  an  optimal  solution  of  (2.1). 

Theorem  2:  Let  e  -  err(w*)  be  the  number  of  errors  in¬ 
curred  by  an  optimal  solution,  w*,of  (2.1).  Suppose  that  the 
algorithm  commences  searching  a  subspace  H  of  dimension 
p>  2  when  r  subspaces  of  dimension  >p  have  already  been 
searched,  where  r<e.  Then,  in  order  to  complete  the  search 
of  H ,  the  algorithm  must  compute  at  least  CpZt*p'i  edge 
vectors  contained  in  H. 

Proof:  The  proof  is  by  induction  on  p.  When  p  =  2,  the 
theorem  asserts  that  the  algorithm  computes  at  least  one  edge 
in  H  to  complete  the  search  of  H.  This  assertion  is  obviously 
true.  Suppose  the  assertion  of  the  theorem  is  true  for  p  =  k\ 
we  wish  to  prove  that  it  is  true  for  p  =  k  +  1 ,  where  2  <p< 
n  +  1. 

When  the  algorithm  commences  searching  H,  it  first  com¬ 
putes  an  edge  vector  wGH  and  the  set  I(w).  The  cardinality 
of  I(w)  is  at  least  e,  since  the  minimum  error  of  any  edge  vec¬ 
tor  in  Rn  * 1  is  e.  According  to  the  induction  hypothesis,  r 
subspaces  of  dimension  >  k  +  1  have  already  been  searched, 
where  r  <  e. 

If  r  =  e,  it  is  possible  that  all  of  the  subspaces  of  the  form 
H  O  Hlt  i  €  l(w),  are  subspaces  of  the  r  subspaces  which  have 
already  been  searched,  in  which  case  the  algorithm  has  already 
completed  the  search  of  H,  having  computed  one  edge  w  in  II. 
The  assertion  of  the  lemma  is  that  the  algorithm  must  com¬ 
pute  at  least 

1  (4.1) 

A 

edges  in  H ;  hence,  it  is  a  true  assertion  in  this  case. 

If  r  <  e,  then  the  algorithm  must  search  some  subspaces  of 
the  form  H  O  H(.  Consider  first  the  case  where  throughout 
the  search  there  is  no  replacement  of  the  initial  vector  w  and 
index  set  /(w).  Then  the  algorithm  must  search  at  least  e  -  r 
of  the  subspaces  of  the  form  H  O  Ht,  i€I(w),  each  of  which 
is  of  dimension  k.  Upon  commencing  to  search  the  first  of 
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these  subspaces,  there  are  r  subspaces  of  dimension  >  k  already 
searched.  Upon  commencing  the  search  of  the  q  th  of  these 
subspaces,  1  <q  <e  -  r,  there  are  r  +  q-  1  subspaces  of  di¬ 
mension  >  k  already  searched.  Therefore,  by  use  of  the  induc¬ 
tion  hypothesis,  we  see  that  to  search  the  qth  of  the  subspaces 
H  C\Ht  requires  the  computation  of  at  least  c/Tj ‘<,  +  I  +  *'1 
edges,  and  that  the  search  of  H  requires  at  least 

1  +‘f  car*"**-2  (4 2) 

.  «•» 

edge  computations.  By  letting /'  =  e-  r-  q  +  1,  this  quantity 
can  also  be  expressed  as 

1+*Z  (4.3) 

It  is  easy  to  show  that  (43)  is  equal  to 

car*'1-  (4.4) 

Hence  the  lower  bound  of  Theorem  2  has  been  verified  for 
p  -  k  +  1  in  the  case  where  there  is  no  replacement  of  the  ini¬ 
tial  vector  w. 

In  the  case  where  w  is  replaced,  it  can  be  seen  that  the 
same  lower  bound  is  still  valid  because  the  algorithm  must 
still  search  at  least  e  -  r  subspaces  of  the  form  H  ft  Ht,  and 
the  qth  subspace  searched  requires  the  computation  of  at  least 
C*-r-q*k-2  e(jgeSj  as  before.  This  concludes  the  proof.  □ 

Corollary  2:  Let  e  =  err(w*)  be  the  number  of  errors  in¬ 
curred  by  an  optimal  solution,  w*,  of  (2.1).  Then  the  lower 
bound  on  the  number  of  edge  vectors  computed  during  execu¬ 
tion  of  the  algorithm  is  given  by  the  binomial  coefficient 

C— - (nM)i - •  (45) 

Proof:  The  proof  follows  from  Theorem  2  with  H  =Rn*1 , 

p«n+  l,andr  =  0.  □ 

The  lower  bound  given  by  (43)  is  shown  in  Table  1  for  vari¬ 
ous  values  of  e  and  n.  These  values  are  compared  in  the  next 
section  against  the  number  of  edge  vectors  actually  computed 
by  the  algorithm  in  a  number  of  examples. 

V.  Experimental  Results 

The  algorithm  developed  in  Section  III  was  programmed  in 
Fortran  IV  and  run  on  an  IBM  370/3031.  The  following  re¬ 
sults  illustrate  the  performance  of  the  procedure  in  separable 
and  inseparable  situations. 

Experiment  1:  The  first  example  is  based  on  the  measure¬ 
ments  performed  by  Fisher  [3]  on  three  types  of  iris  flowers: 
Iris  Versicolor,  Iris  Virginica,  and  Iris  Setosa.  For  each  type  of 
flower,  four  measurements  (petal  length  and  width  and  sepal 
length  and  width)  were  taken  on  SO  specimens.  This  leads  us 
to  consider  three  two-class  discrimination  problems  with  m  = 
100  and  n  =  4.  The  results  summarized  in  Table  II  agree  with 
the  well-known  fact  that  two  of  the  pairs  are  separable,  and 
one  pair  is  inseparable  with  the  optimal  solution  yielding  one 
error.  It  is  of  interest  to  compare  the  lower  bound  given  in 
Table  I  with  the  actual  number  of  edge  vectors  computed  by 
•he  algorithm  in  the  inseparable  case.  In  the  separable  case. 


TABLE  I 

Lower  Bound  on  the  Number  of  Edges  Computed  by  the  Algorithm 


2 

_  3 

4 

_ 3_ 

6 

7 

_ 8. 

9 

10 

1 

2 

3 

4 

S 

6 

7 

8 

9 

10 

2 

3 

6 

10 

IS 

21 
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36 

45 

55 

3 

4 

10 

20 

35 

56 

84 

120 

165 

220 

4 

s 

IS 

36 

70 

126 

210 

330 

495 

715 

s 

6 

21 

56 

126 

2S2 

462 

792 

1, 

.287 

2.002 

6 

7 

28 

84 

210 

462 

924 

1,716 

3. 

.003 

5.005 

7 

8 

36 

120 

330 

792 

1,716 

3.432 

6, 

.435 

11,440 

8 

9 

«S 

16S 

495 

1.287 

3.003 

6.43S 

12. 

,870 

24,310 

9 

10 

ss 

220 

71S 

2.002 

S.OOS 

11.440 

24, 

,310 

48,620 

10 

11 

66 

286 

1,001 

3.003 

8.008 

19.448 

43, 

,758 

92.378 

TABLE  II 

Pairwise  Discrimination  of  the  Iris  Data  Classes 


Classes  discriminated: 

Versicolor,  Virginica 

WeiQht  vector: 

0.0038  0.0070 

-0.0208  -0.0251 

1.0000 

Error: 

1 

Lower  bound: 

4 

No.  of  edges 
computed: 

49 

CPU  time: 

1.61  sec. 

Classes  discriminated: 

Virginica,  Setosa 

Melqht  vector: 

-0.0087  0.0131 

0.0257  -0.0033 

-1.0000 

Error: 

0 

No.  of  edges 
computed: 

10 

CPU  time: 

0.34  sec. 

Classes  discriminated: 

Setosa,  Versicolor 

Weight  vector: 

0.0378  -0.0118  -0.0164  -0.0355 

-1.0000 

Error: 

0 

No.  of  edges 
computed: 

19 

CPU  time: 

0.59  sec. 

the  minimum  number  of  edges  that  the  algorithm  must  com¬ 
pute  is,  of  course,  1. 

Experiment  2:  In  this  experiment  we  compared  the  perfor¬ 
mance  of  the  algorithm  against  the  procedure  developed  by 
Warmack  and  Gonzalez  [5]  which  is  also  based  on  the  use  of 
an  index  set.  Two  four- dimensional  Gaussian  classes  of  SO 
patterns  each  were  generated  using  a  program  developed  by 
Bryan  and  Tebbe  (16].  The  two  pattern  populations  were 
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TABLE  III 

Result  or  Experiment  with  Four-Dimensional,  Inseparaile 


Gaussian  Data 

Height  vector: 

-0.4903  -0.2100  -0.38%  -0.2276  1.0000 

Error: 

4 

Lower  bound: 

35 

No.  of  edoes 

23S 

cooputed: 

CPU  tine: 

8.8$  sec. 

TABLE  IV 

Result  of  Experiment  with  Six-Dimensional,  Separaiu 
Gaussian  Data 

■■  ■  ■■ 1  "  .  .  — 


Netqht  vector: 

-7.1933  -9.6779  1.5035  -2.7608  -4.2783 

•3.6190  1.0000 

Error: 

0 

No.  of  edges 
corputetf: 

36 

CPU  tine: 

1.67  sec. 

TABLE  V 

Result  of  Experiment  with  Two  Five-Dimensional  Pattern  Classes 
Separated  ry  the  Boundary  Xi  +  xi  +  xf  +  x,  +  x,  -  1.5  *  0 


UrlQht  vector: 

-0.7255  -0.2162  -0.6915  -0.6420  -0.6328  1.0000 

Error: 

0 

go.  oi^fasi 

S2 

CW  Um: 

180  see. 

generated  by  specifying  an  identity  covariance  matrix  for  each 
dasswith  mean  vectors  (0, 0,0,0) and  (13,  13, 13, 1 3), re¬ 
spectively.  The  results  are  summarized  in  Table  III,  which 
shows  that  235  edges  were  investigated,  requiring  835  s  of 
CPU  time.  By  contrast,  the  Warmack-Gonzalez  algorithm  in¬ 
vestigated  IS  854  edges  requiring  a  total  of  206  s  of  CPU  time. 

In  another  experiment  we  generated  two  groups  of  six¬ 
dimensional  patterns  with  unity  covariance  matrices  and  mean 
vectors  with  components  equal  to  -0.75  and 0.75, respectively. 
In  this  case  the  classes  were  linearly  separable,  and  the  results 
as  summarized  in  Table  IV,  which  shows  that  36  edges,  requir¬ 
ing  1.67  s,  were  investigated  by  our  algorithm.  By  contrast, 
the  Warmack-Gonzalez  algorithm  computed  51  826  edges,  re¬ 
quiring  approximately  13  min  of  CPU  time. 

Experiment  3:  In  this  example  we  compare  our  algorithm 
against  a  procedure  developed  by  Miyake  (8) .  Using  the  mul¬ 
tivariate  Gaussian  generator  mentioned  in  the  previous  exam¬ 
ple,  two  groups  of  100,  five-dimensional  patterns  with  unity 
covariance  matrices  and  means  at  0  and  (0.6, 0.6, 0.6, 0.6, 0.6) 
were  generated  satisfying  the  conditions  x,  +x2  +x,  +x«  + 
*s  -  13  >0  and  x,  +  Xj  +x,  +  x«  ♦  *,  -  13  <0,  respec¬ 
tively,  which  is  analogous  to  the  separable  data  set  used  in  (8] . 


The  results  are  summarized  in  Table  V,  which  shows  that  82 
edges  were  investigated,  requiring  330  s  of  CPU  time.  Miyake 
reported  CPU  times  of  10-25  min  (depending  on  the  starting 
weight  vector)  on  a  data  set  generated  with  the  same  parame¬ 
ters.  He  used  a  FACOM  230-45S,  which  is  approximately  four 
times  slower  than  the  IBM  370/3031 .  Taking  into  account  the 
difference  between  the  two  machines,  and  using  the  lower  10 
min  figure  to  allow  for  programming  and  other  variations  in 
implementation  between  the  two  experiments,  it  appears  that 
the  procedure  reported  in  [8]  is,  conservatively,  on  the  order 
of  40  times  slower  than  ours. 

VI.  Conclusions 

The  computational  advantage  of  the  algorithm  developed  in 
Section  III  over  other  direct  procedures  for  finding  an  optimal 
solution  of  (2.1)  is  based  on  two  principal  factors:  the  replace¬ 
ment  and  deletion  rule  discussed  in  Section  M-C,  and  the 
ordering  rule  developed  in  Section  III-D.  The  first  rule  reduces 
the  size  of  the  search  tree  by  trimming  branches  at  the  highest 
possible  level  and  by  keeping  an  account  at  each  level  of  the 
subspaces  that  have  been  previously  searched  at  that  level. 
The  ordering  rule  arranges  subspaces  in  order  of  increasing  er¬ 
ror.  Hiis  procedure  enhances  the  computational  efficiency  of 
the  algorithm  by  increasing  the  frequency  with  which  lower- 
error  edges  are  encountered  during  the  search.  As  indicated  in 
Section  III-D,  the  use  of  an  ordering  procedure  is  made  feasi¬ 
ble  by  the  fact  that  we  are  able  to  obtain  (he  error  of  an  edge 
vector  without  actually  having  to  compute  that  vector. 

The  lower  bound  developed  in  Section  IV  for  the  minimum 
number  of  edges  that  must  be  computed  by  the  algorithm  pro¬ 
vides  a  useful  measure  of  the  minimum  amount  of  computa¬ 
tional  work  required  to  find  an  optimal  solution  of  (2.1).  As 
in  any  search  procedure,  the  number  of  edges  investigated 
should  be  expected  to  grow  rapidly  as  a  function  of  the  num¬ 
ber  and  dimensionality  of  the  patterns,  as  well  as  the  error 
rate.  Although  no  theoretical  bound  involving  these  parame¬ 
ters  appears  feasible,  we  have  found  in  practice  that  the  num¬ 
ber  of  edges  actually  computed  in  inseparable  cases  in  typi¬ 
cally  on  the  order  of  10  to  30  times  the  lower  bound.  In 
separable  situations,  the  algorithm  has  consistently  found  an 
optimal  solution  after  computing  a  fraction  of  the  number  of 
edges  required  by  the  other  direct  procedures  against  which  it 
has  been  compared. 

Finally,  we  point  out  that  although  all  discussions  have  been 
limited  to  a  linear-discriminant  function  formulation,  the  con¬ 
cepts  and  procedures  developed  in  this  paper  are  also  applica¬ 
ble  to  nonlinear  discriminant  functions  by  the  standard  pre¬ 
processing  technique  of  using  the  nonlinear  functions  to  map 
the  patterns  onto  another  space  [1] . 
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expressed  in  a  simple  polynomial  form.  The  nth  moment  is  expressible  as  a 
polynomial  of  order  n  whose  variable  depends  on  the  mean  vectors  and 
(tfenvalues  of  the  covariance  matrices.  A  closed-fora  solution  is  given  for 
computing  the  coefficients  of  the  polynomial  expressions. 

L  Introduction 

Pattern  recognition  and  image  processing  techniques  based  on 
the  Mahalanobis  distance  have  found  wide  applicability,  ranging 
from  nuclear  reactor  surveillance  and  automated  analysis  of 
image  texture  data  to  discrimination  problems  in  biomedical 
observations  [1],  [2],  [3]. 

The  importance  of  the  Mahalanobis  distance  classifier  lies  in 
the  fact  that,  under  a  Gaussian  assumption,  it  is  an  optimal 
discriminant  in  the  Bayes  sense  [4],  The  estimation  of  the  proba¬ 
bility  density  function  (pdf)  of  the  interclass  Mahalanobis  dis¬ 
tance  has  been  a  topic  of  active  interest  for  a  number  of  years 
because  of  its  direct  relation  to  the  probability  of  error  of  Bayes’ 
classifier  (5].  For  Gaussian  data  with  equal  covariance  matrices, 
the  solution  of  this  problem  is  straightforward  [6],  When  the 
covariance  matrices  are  not  equal,  however,  the  problem  becomes 
considerably  more  complicated,  requiring  the  use  of  numerical 
integration  techniques  for  computing  the  pdf  (7], 

In  many  applications  (e.g.,  cluster  seeking,  texture  analysis, 
and  measuring  spatial  stationarity  of  multivariate  data)  it  is  often 
of  interest  to  compute  the  moments  of  the  interclass  Mahalanobis 
distance  without  having  to  estimate  its  underlying  pdf  as  an 
intermediate  step.  It  is  shown  in  this  paper  that  these  moments 
can  be  expressed  directly  in  terms  of  a  polynomial  whose  coeffi¬ 
cients  are  given  by  a  straightforward  closed-form  expression.  The 
relative  simplicity  of  these  results  has  important  implications  in 
terms  of  implementation  in  a  digital  computer. 

II.  Background 

Consider  two  (/-dimensional  Gaussian  vector  populations  { x } 
and  (  y)  with  mean  vectors  and  covariance  matrices  mx,  m Cx, 
and  Cy.  respectively.  The  intraclass  Mahalanobis  distance'  be¬ 
tween  any  member  of  {x}  and  m,  is  given  by  the  familiar 
equation  |1] 

*(*.  "»*)“(*-  »i*)rC(*  ~  i"*).  0) 

and.  similarly, 

where  “7”*  indicates  the  transpose. 

As  indicated  in  the  previous  section,  (1)  and  (2)  have  been 
applied  extensively  in  pattern  recognition.  In  this  paper,  we  are 
interested  in  characterizing  the  interclass  Mahalanobis  distance 
between  members  of  x  and  the  mean  my,  which  is  given  by 


R(x,my)-(x-my)rC;'(x-m,)  (3) 

and  similarly, 

R(y,mx)  -  (y  -  mx)TCx'{y  -  *,).  (4) 

For  any  nonsingular,  real  transformation  matrix  A  it  is  easily 
shown  that  if 

r-  Ax  (5) 

and 

s  -  Ay,  (6) 

then  r  and  s  are  Gaussian  random  variables  with  mean  vectors 
m,  -  Am,  (7) 

m,  —  Amy  (8) 


and  covariance  matrices 


Cr  "  ACxAr 

(9) 

C,  ”  ACyAT. 

(10) 

It  is  also  easily  shown  that 

R{r,m,)  -  Btx.m,.) 

(11) 

and 

R(s,mr)  -  R(y,mx). 

(12) 

Furthermore,  as  described  in  (61  and  [16],  the  transformation 

matrix  A  can  be  chosen  so  that 

C,-ACxAr-I 

(13) 

and 

C,  -  ACyAr  -  D, 

(14) 

where  /  is  the  identity  matrix  and  D  is  a  diagonal  matrix  with 
elements  y(t),  i  -  1,2,-  -  -,<f,  along  the  main  diagonal.  The  ele¬ 
ments  y(0  are  the  eigenvalues  of  C/'C„.  From  (13),  it  is  noted 
that  the  elements  of  r  are  uncorrelated  which,  in  view  of  our 
Gaussian  assumption,  implies  statistical  independence.  The  same 
holds  true  for  the  elements  of  s. 

Using  (3),  (11),  and  (14),  it  follows  that 

R(x.m,)-  R(r,  m,) 

-  (r-  m,)rD-‘(r-  m,) 

-  fu-mjV'O).  (15) 


where  r,  and  m,„  i  -  1,2,-  •  -,«/,  are  the  components  of  vectors  r 
and  m„  respectively.  Since  r  is  a  Gaussian  random  vector  and 
C,  -  /,  we  have  that  the  variable  z,  -  (r,  -  m„)  is  Gaussian  with 
mean  (m,,  -  m„)  and  unit  variance.  It  then  follows  (9)  that 

w,  -  z}  -  ( r ;  -  m,,)2 


is  a  nonccntral  chi-square  variable  with  density 

d(w  )  -  e-*>  T  -  ’  -I - - - 

k-okrj'***>'*r[l+22k) 

and  moment  generating  function 
*-o  *• 


where 


(16) 

(H) 

(18) 

(19) 


Since  r,,i  -  1,2 ,-•-,</,  are  statistically  independent,  it  follows 
that  the  w,  defined  in  (16)  are  also  statistically  independent. 

A  similar  development  can  be  carried  out  for  R(  y,  mx): 

R(y,mx)  -  R(s,m,) 


-  (s  -  m,)rr'(j  -  mr) 

-  L  (*,  -  ««)2. 

i-t 


(20) 


The  variable  z,  -  (s,  -  m,,)/  Jy(i)  is  Gaussian  with  mean 
(m„  -  m,,)/  fiU)  and  unit  variance.  As  above,  the  variable 


w.  -  ~  «r,)' 


(21) 


1  !  hi,  n  in  rcalilv  a  tifimrnl  distance.  However,  it  has  become  customary  to 
Met  in  this  measure  simply  as  the  Mahalaiwlin  Julunce 
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has  the  density  and  moment  generating  function  given  in  (17) 
and  (18),  but  X,  is  now  given  by 


“  2y(i)  ' 


III.  Moments  of  the  Interclass  Mahalanobis 
Distance. 

It  is  shown  in  this  section  that  the  moments  about  zero  of  w, 
can  be  expressed  in  terms  of  X,.  Once  these  moments  have  been 
obtained,  they  will  be  used  to  obtain  the  moments  of  the  inter- 
class  Mahalanobis  distance. 

A.  Moments  about  Zero  of  w. 

We  begin  the  development  of  noting  that  the  nth  moment 
about  zero  of  w,  is  given  by 

«.(",) -£{<}-<'«>)  (23) 

where  ^’•(O)  is  the  nth  derivative  of  (18)  with  respect  to  r, 
evaluated' at  t  -  0(10].  Evaluating  (23)  with  the  moment  generat¬ 
ing  function  given  in  (18)  leads  to  the  following  theorem. 

Theorem  I:  Let  am(w,)  denote  the  nth  moment  about  zero  of 
w,.  Then 

(24) 

r-0 


where  A°/(t)  -  /(*)>  A‘/(x)  -  A/(x)  -  f(x  +  1)  -  /(x),  and 
A7(x)-A(A'-'/(*))  (32) 

-  +  (33) 

for  r  >  2. 

Since  117.  ,(2x  +  2s  -  1)  is  a  polynomial  in  x  of  degree  n,  it 
follows  that  b(n,  r)  -  0  if  r  >  n,  and  hence  from  (30)  that 


«»(“'.)  ”  £  c(n,r)X', 

r-0 

(34) 

c(n,r)  ■  b(n,r)/r\. 

(35) 

c(n,  r)-2'(;)n(2y  +  l) 


for  all  n  1  and  0  <  r  <  n  -  1,  and 
c(n,n)  -  2" 

for  all  n  >  0. 

Proof:  From  (18)  and  (23), 
«.(h-i)-*I;,(0) 


(27) 

-  «'*'  L  T?  ri  (2*  +  2s  -  1)  (28) 

00  /_•  00  \k  A 

.LUl±L±LU(2k  +  2s-l)  (29) 

j- 0  k- 

i  (-1) — (^)  n(2m  +  2s  -1). 

r-0  r ’  m-0  J"  1 


where  (30)  follows  from  (29)  by  the  standard  rule  for  multiplying 
Taylor  series. 

By  a  well-known  formula  from  the  calculus  of  finite  differences 
(15] 

»<*.')■  t  (-ir"tt)n<2-*2s-D 

m-0  »“> 

“  A'I~I  (2x  +  2s  -  1)1  (31) 


fl  (2x  +  2s  -  1)  -  2"((x  +  \)(x  +  §)-(*  +  V)) 

(36) 

—  2"u(u  -  1)  •••(«  —  n  +  1)  (37) 

where 

u  •  x  +(2n  —  l)/2.  (38) 

It  follows  easily  by  induction  from  (33)  that 
A,2"u(u-  1)  -••(«-  n  +  l)-2 "n(n-  1)  -- 

•(«  -  r  +  l)«(u  -  1)  •••(«- n  +  r+  1)  (39) 
when  r  <  n,  and  that 

A"2"u(u-l)-(u-n  +  l)-2"n?  (40) 

Hence.  (31).  (37),  (38).  and  (39)  yield 

II-  1 

6(n,  r)  »  2rn(«  -  1)  ■  -  •  (n  -  r  +  1)  J”!  (2j  +  1)  (41) 

J-r 

if  n  5s  1  and  0  <  r  <  n.  and  (31),  (37),  (38),  and  (40)  yield 

b(n,  n)  —  2"n!  (42) 

Dividing  (41)  by  r!  and  (42)  by  n!  yields  (25)  and  (26),  as  desired. 
In  particular,  with  c(0,0)  =  1, 

c(».0)  =*  FI  (2 j  +  1)  -  (2n  -  l)c(n  -  1,0)  (43) 

j-o 

for  »?  1,  and 

C(,,’r)~2r(~2;-V)c(",r~1)  (44) 

for  n  2  1  and  1  £  r  «  a.  The  recurrence  relations  (43)  and  (44) 
enable  one  to  generate  the  triangular  array  of  numbers  c(n,  r) 
with  case.  Hence  one  may  easily  calculate  the  polynomials  <*„(»,), 
which  arc  listed  below  for  0  <  n  <  5: 

«<i(»i )  "  1 

«i(>*’,)  ”  1  +  2X, 

«,(».,)«  3 +  12X.  +  4X2 

o3(h;)  -  15  +  90X,  +  60X2  +  8XJ, 

a4 (  h; )  -  105  +  840X,  +  840X2  +  224X’  +  16X* 

os(h- )  -  945  +  94 SOX,  +  12600X2  +  5040X’  +  720X*  +  32X’. 

(45) 
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B.  Moments  about  Zero  of  R 
From  (15)  and  (16), 


R(x,my)~  E 


.[*(*.«,)]  ~ e( g|!g2?".!. .'7^7 ) n [c(«..o)/(t(0)*‘] 


The  nth  moment  about  zero  of  R  is  then 


£{/r(x,m,)}  -  EVy(0  }•  ** [ R ( y, «,)]  ■  E ( j^rrrr^i )  n  1  (*(*« .°)r(« )r,l 


The  coefficients  of  a  sum  raised  to  the  nth  power  are  given  by  the 
multinomial  theorem  (13);  that  is.  Equal  Mean  Vectors  and  Covariance  Matrices  (Intraclass 

.  Mahalanobis  Distance) 

«.[*(*.»•,)]  -  £(E;  _ ,  ntVrtOr-).  W1?en  m»  “  "V  “ "  1ind  cj~cy~  c-  we  hav'  on,y  °“e 

\  ei-ei-  ee-  i-i  l  population  and  the  problem  reduces  to  computing  the  moments 

(48)  intraclass  Mahalanobis  distance.  It  follows  from  (52)  and 

'  '  (S3)  and  the  fact  that  each  y(i)  -  1  (sec  the  remarks  above  on 

where  the  summation  is  taken  over  all  nonnegative  values  of  equal  covariance  matrices)  that 
e„  *2.‘  •  •.*,/  such  that  e,  +  e2  +  •  •  •  +  -  n.  n, 

In  view  of  the  independence  of  the  w,\  it  follows  that  n.[  R(x ,  m,)]  -  E  — ; - r~ - r 

ei"»  ejV  • 


,[/{(*,«,,)]  -  e|e  ~ei;g2;".!..  ei\  nKMor}. 


».[*(*."V)j  -E(  )  II  [ a<.  ( - ». )/( y  ( 0) ' *  ] 


where  the  a,  (w,),  i  -  1,2,-  •  are  given  by  (24),  using  values 
of  X,  given  by  (19). 

Since 

<r 

i-t 

■E»,r(0.  (50) 

i-t 

where  «v,  is  given  by  (21),  it  follows  using  a  similar  development 
that 


•  n  (2«  - 1)  •  -  n(2«-n 

i-t  i-i 

-  n'(rf+2A 

7-0 


where  the  summation  in  (54)  is  over  all  e,  >  0  such  that  e,  +  e2 
+  •  •  •  +  tj  -  n,  and  (55)  follows  from  (54)  by  the  following 
argument.  By  the  extended  binomial  theorem 

(1  -  2x)~l/1  «•  E('i/2)(-2*)' 

-  f  (  fl  (2r  -  l))ff  -  (56) 

,-oW-o  )*'• 


d  Raising  (56)  to  the  rfth  power  yields 

^j  nJ^iXTO))'].  (!  -2 xyd/1 


where  the  o,(H>1),i  -  1,2,- are  given  by  (24)  using  values 
X,  as  given  in  (22). 


l)f  -  i  [z ( ri  w  -  ■)  •  ■ •  ■ •  n &  - »)/,,! . . . 

of  ii  — 0  v  \*->  »- >  / 


IV.  Special  Cases  On  the  other  hand,  the  extended  binomial  theorem  yields 

In  this  section  we  consider  special  cases  involving  various  ®  ,  . 


au  uiu  avviiuai  wv  VWIIOIUVJ  upwiiu  vujvj  » ‘**6  » ui  iwwd  w  .  .  _  » 

arrangements  of  mean  vectors  and  covariance  matrices  of  two  (i  -  2x)~dn  -  E  (  ~d'2 )( -2x)" 
pattern  populations.  «-o'  "  ’ 

Equal  Covariance  Matrices  m  ^  (<f)(d+2)---(</+2n-2)^. 

When  C,  -  Cf  -  C,  it  follows  from  (13)  and  (14)  that  Cr  -  C,  *-o  n- 

a™- *■  «“—» »' *• » m - <>»> i— 

via  (49)  and  (51).  One  observes  from  (55)  that  the  moment  in  question  depends 

Fyln/  Mean  Vectors  only  on  the  order  of  the  moment  and  the  dimension  of  the 

When  «r  -  my,  it  follows  from  (7)  and  (8)  that  m ,  -  m,  and.  paltern  vec,ors‘ 
consequently,  X ,  -  0  in  (19)  and  (22).  Then  from  (24)  and  (25) 

V.  Conclusion 

*"  w,  c  n.  The  expressions  given  in  (24),  (25),  (26),  (49),  and  (51)  lead  to  a 

n"  straightforward  algorithm  for  computing  the  moments  about  zero 

j_i'  1  ~  '  '  of  the  interclass  Mahalanobis  distance.  If  desired,  the  central 

»  ,  .  1  moments  can  be  obtained  from  these  results  by  means  of  a 

w  noth  populations.  Substitution  of  (52)  into  (49)  and  (51)  well-known  transformation  (10). 


'.■•V'.i 

*  .*  •.*,*' 
•;vv 

;>■>> 


vv 


7TF? 


T 


•7. 


’Trrrrrrr?. 


W  ’>  W-1 
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The  importance  of  these  results  is  that  they  allow  direct  com¬ 
putation  of  the  moments  without  having  to  resort  to  the  inter¬ 
mediate  step  of  obtaining  the  pdf  which,  as  indicated  in  Section  I, 
is  not  a  trivial  problem  in  the  case  of  unequal  covariance  matrices. 

The  expressions  for  the  moments  were  considerably  simplified 
in  the  special  cases  discussed  in  Section  IV.  In  particular,  the 
intraclass  Mahalanobis  distance  was  shown  to  lead  to  expressions 
which  depend  only  on  the  order  of  the  moments  and  the  dimen¬ 
sion  of  the  vector  populations. 
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Semi-invariants  of  the  Interclass  Mahalanobis 
Distance 

R.C.  GONZALEZ,  EEUOW.tEEE,  and  C.G.  WAGNER 

Abstract— A  new  closed-form  expression  for  the  semi-invariants  of  the 
interclass  Mahalanobis  distance  is  derived.  Typically,  in  the  analysis  of  two 
multivariate  Gaussian  populations  with  different  covariance  matrices, 
simultaneous  diagonal izarion  of  these  matrices  is  required.  The  semi-in¬ 
variants  are  given  directly  in  terms  of  the  mean  vectors  and  inverse 
covariance  matrices  by  the  results  established  in  this  correspondence.  In 
addition,  a  new  iterative  algorithm  is  derived  for  computing  the  moments  of 
the  interclass  Mahalanobis  distance  from  the  semi-invariants. 
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L  Introduction 


and 


Pattern  recognition  and  image  processing  techniques  based  on 
the  Mahalanobis  distance  have  found  wide  applicability,  ranging 
from  nuclear  reactor  surveillance  and  automated  analysis  of 
image  texture  data  to  discrimination  problems  in  biomedical 
observations  (1],  (2],  [3]. 

The  importance  of  the  Mahalanobis  distance  classifier  lies  in 
the  fact  that  under  a  Gaussian  assumption  it  is  an  optimal 
discriminant  in  the  Bayes  sense  [4].  The  estimation  of  the  proba¬ 
bility  density  function  (pdf)  of  the  interclass  Mahalanobis  dis¬ 
tance  has  been  a  topic  of  active  interest  for  a  number  of  years 
because  of  its  direct  relation  to  the  probability  of  error  of  the 
Bayes  classifier  (5}.  For  Gaussian  data  with  equal  covariance 
matrices,  the  solution  of  this  problem  is  straightforward  [6], 
When  the  covariance  matrices  are  not  equal,  however,  the  prob¬ 
lem  becomes  considerably  more  complicated,  requiring  the  use  of 
numerical  integration  techniques  for  computing  the  pdf  [7], 

In  many  applications  (e.g.,  cluster  seeking,  texture  analysis, 
and  measuring  spatial  stationarity  of  multivariate  data)  it  is  often 
of  interest  to  compute  descriptors  based  on  the  interclass 
Mahalanobis  distance.  Two  such  descriptors  are  the  moments 
and  semi-invariants.  In  an  earlier  paper,  we  established  that  the 
nth  moment  could  be  expressed  as  a  polynomial  of  degree  n  and 
gave  a  closed-form  solution  for  computing  the  coefficients  (17], 
This  procedure,  however,  requires  that  the  covariance  matrices  of 
the  two  populations  be  simultaneously  diagonalized. 

The  present  work  deals  with  the  derivation  of  a  closed-form 
expression  for  the  semi-invariants  of  the  Mahalanobis  distance. 
This  expression  involves  the  mean  vectors  and  inverse  covariance 
matrices  directly  and  does  not  require  the  diagonalization  of 
these  matrices.  It  is  well  known  that  the  moments  and  semi-in- 
variants  are  related'  by  expressions  that,  though  theoretically 
simple,  are  quite  inefficient  in  terms  of  computer  implementation 
[11],  A  new,  iterative  algorithm  that  is  easily  implemcntable  in  a 
digital  computer  is  presented  in  Section  IV  for  computing  the 
moments  from  a  given  set  of  semi-invariants. 

II.  Background 

Consider  two  d-dimensional  Gaussian  vector  populations  ( x } 
and  (y)  with  mean  vectors  and  covariance  matrices  mz,  my, 
Cx,  and  Cy,  respectively.  The  intraclass  Mahalanobis  distance1 
between  any  member  of  {x}  and  m,  is  given  by  the  familiar 
equation  [1] 

R(x,mM)  —  (x  —  mx)TCyl(x  —  mx)  (1) 

and  similarly, 

R(y<i"y)-(y-my)TC;'(y-my),  (2) 


s- Ay  (6) 

then  r  and  s  are  Gaussian  random  variables  with  mean  vectors 


m,  -  Amx 

(7) 

m,  —  Anty 

(«) 

and  covariance  matrices 

C,-ACXAT 

'  (9) 

C,-ACyAr. 

(10) 

It  is  also  easily  shown  that 

R(r,m,)  -  R(x,my) 

(U) 

and 

R(s,mr)-  R(y,mx). 

(12) 

Furthermore,  as  described  in  [61  and  [16],  the  transformation 

matrix  A  can  be  chosen  so  that 

C^-AC^-I 

(13) 

and 

c,~ac,at-d 

(14) 

where  /  is  the  identity  matrix  and  D  is  a  diagonal  matrix  with 
elements  y (i),  f-  1,2, •••,</,  along  the  main  diagonal.2  The 
elements  y(«)  are  the  eigenvalues  of  C~xCy.  From  (13),  it  is 
noted  that  the  elements  of  r  are  uncorrelated  which,  in  view  of 
our  Gaussian  assumption,  implies  statistical  independence.  The 
same  holds  true  for  the  elements  of  s. 

Using  (3),  (11),  and  (14),  it  follows  that 

R(x,my)  -  R{r,m,) 

-  (r  -  mM)TD-'{r  -  m.) 

-  L  ~  )V‘(»).  (15) 

i-i 

where  r,  and  m„,  i  -  1 , 2,  •  •  • ,  d,  are  the  components  of  vectors  r 
and  m„  respectively.  Since  r  is  a  Gaussian  random  vector  and 
Cr  —  I,  we  have  that  the  variable  z,  -  (r,  -  msi)  is  Gaussian 
with  mean  (mri  -  msi)  and  unit  variance.  It  then  follows  [9]  that 

» i  “**“(»<-  m,if  (16) 

is  a  non-central  chi-square  variable  with  density 


where  T  indicates  the  transpose. 

As  indicated  in  the  previous  section,  (1)  and  (2)  have  been 
applied  extensively  in  pattern  recognition.  In  this  work,  we  are 
interested  in  characterizing  the  interclass  Mahalanobis  distance 
between  members  of  x  and  the  mean  my,  which  is  given  by 


/?(*«,) -e~x<£ 


^kw(l*2k)/2e-w,/2 


k-o  A!2(l+u,/2p 
and  moment  generating  function 


m 


(17) 


R(x,my)“  (x- my)rCf'(x- my)  (3) 

and  similarly, 

*(.?.»,)■  (y~"x)TC;l(y  ~mx).  (4) 

For  any  nonsingular,  real  transformation  matrix  A,  it  is  easily 
shown  that  if 

r-Ax  (5) 


M') S(l-2r)-,,+n,/2  (18) 

*-o  *• 

where 

K  “  |(*«  ~  (19) 

Since  r„  j  -  1,2,-  ■  -  ,d,  are  statistically  independent,  it  follows 
that  the  iv,  defined  in  (16)  are  also  statistically  independent. 


'This  is  in  reality  a  squared  distance.  However,  it  has  become  customary  to 
refer  to  this  measure  simply  as  the  Mahalanobis  distance. 


’Although  diagonalization  is  not  required  in  our  final  results,  (13)  and  (14) 
are  used  in  proving  the  theorem  given  in  the  next  section. 
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A  similar  development  can  be  carried  out  for  R(y,  mx): 

R(y,mx)  -  R(s,mr) 

"  (*  “  m,)Trl(s  -  mr) 

-  L  (j,  -  mr,f.  (20) 

i-i  _ 

The  variable  z,  ~  (s,  -  mri)/  yjy(i)  is  Gaussian  with  mean 
(mlt  -  mri)/ ]/y(i )  and  unit  variance.  As  above,  the  variable 

w‘  ~ ~  yb)  ^s‘  ~  (21) 

has  the  density  and  moment  generating  function  given  in  (IT) 
and  (18),  but  Xf  is  now  given  by 

X'“  2y[i)^m,i  (22) 


III.  Semi-Invariants  of  the  Interclass  Mahalanobis 
Distance 

One  of  the  most  important  properties  of  the  semi-invariants  is 
that  the  nth  semi-invariant  of  a  sum  of  independen’  random 
variables  is  equal  to  the  sum  of  the  n  semi-invariants  of  the 
individual  variables  [10],  [11].  As  will  be  seen  in  the  following 
discussion,  this  property  leads  to  a  straightforward  procedure  for 
computing  the  semi-invariants  of  the  interclass  Mahalanobis  dis¬ 
tance,  using  only  the  original  mean  vectors  and  inverse  covari¬ 
ance  matrices  of  the  given  populations. 


.A.  Semi-invariants  of  w,. 

We  first  obtain  the  semi-invariants  of  w,  and  then  extend  the 
results  to  the  general  case  involving  R.  The  nth  semi-invariant  of 
a  random  variable  w,  with  moment  generating  function  <t>w  (t)  is 
defined  [14]  as 

(23) 

Use  of  (18)  in  this  definition  leads  to  the  following  result  involv¬ 
ing  \t. 

Lemma:  The  nth  semi-invariant  of  w,  is  given  by  the  expres¬ 
sion 

Xn(wi)  m  2"-,(n  -  l)!fl  +  2nXj.  (24) 

,  Proof:  From  (18) 


The  infinite  summation  is  recognized  as  the  Taylor  expansion  of 

-2o.  tijerefore 


M')  “  e~K(l  "  20",/V-/<l-2'\  (25) 


Use  of  (23)  yields 

x.w  -£[*■'-*■  *'"<>-  '<r'n  * 

-  jp[Mi  -  20-1,.. + 


-  2"-‘(n  -  1)!  +  2 "«!X, 
-2’-'(»-l)!ll  +  2nX,l. 
This  concludes  the  proof. 


As  an  illustration,  the  first  five  semi-invariants  of  w,  are 
*,(*,)- 1  +2X, 

*2(w,)-2  +  8X, 

2fj(w,)  -  8  +  48X, 

2f«(*v,)  -  48  +  384X, 

Xs(wf)  -  384  +  3840X(.  (26) 


B.  Semi-invariants  of  R 

The  semi-invariants  of  the  interclass  Mahalanobis  distance 
R(x,  my)  are  given  by  the  following  theorem. 

Theorem:  The  nth  semi-invariants  of  R(x,my)  and  R(y,mx' 
are  given,  respectively,  by 


*„[*(*,«,)]  -  2-‘(n  -  l)'.[tr  ((C/'C,)'} 

+  n(mx  -  my)TC~1(CxC~lY  \mx 


and 

X„[R(y,mx)]  -  2—  (n  -  l)l[tr  {(C^C,)"} 

+n(my- mx)rC'l(CyC; '*)"  ‘(m. 

Proof:  From  (15)  and  (16) 

d 

R(x,my)  -  E  w,y~l0)- 

<- 1 


m,)] 

(27) 


(28; 


Since  the  nth  semi-invariant  of  a  sum  of  independent  random 
variables  is  the  sum  of  the  semi-invariants,  and  _ '(/)]  - 

X*(»,)y~m(i)  [10],  we  have 

*.[*(*,,„,)]  -  £  Xn(w,)T-(,).  (29) 

1-1 

Then,  from  the  lemma  in  Section  III-A, 

2C,[/?(x,«,)]  -  E  2"-'(n  -  1)![1  +  2nX,]y-"(f) 

1-1 

-  2"-*(n  -  1)!  E  y""(0  +  2"n!  £  X,T""(0- 

i- 1  i-1 

(30) 

Since  y(/),  «'  -  1,2 ,  --,d  are  the  eigenvalues  of  C~lCy,  the 
sum  of  the  eigenvalues  y~’(i)  for  *  —  1,2, — ,</,  is  the  trace  of 
(Cr'C.r.  Modifying  (30)  by  this  observation,  and  using  the 
definition  of  X,  given  in  (19),  yields 

2f.[/f(x,«>.)]-2-,(n-l)!tr{(c>:'CJr)"} 

+  (31) 

i-i  y  ui 


The  summation  in  (31)  may  be  expressed  as 

£  -  -  («>  -  ».)r(P-‘)"(»r  -  "»,)  (32) 

i-i  y  w 

However,  from  (7)  and  (8), 

(mr-  m,)-A(mx  -  my)  (33) 
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tnd,  from  (14), 

'(•-■)*- [(a<^T'|‘ 

-  - 

Expansion  of  the  right  side  of  this  equation  yields 

-  (Ary'C;'A -'(ATylC~'A  -  -  -  ■  (Ary'c;'A  - 
But  from  (13),  A~\AT)~l  -  Cx,  so  that 

(D~'y  -  (ArylC;'(cxC;iy~lA-1.  (34) 
By  substituting  (33)  and  (34)  into  (31)  we  obtain 

£  ^TTT^  “ <m* "  «,)rqr'(  W)""‘(m.  -  «,)• 

i-i  i  w 


£  --ztt/  - -  -  «,)rqr'( W)  (-.  -  »,>• 

l-I 

(35) 

Finally,  substituting  this  result  into  (31)  completes  the  proof  of 


The  proof  of  (28)  follows  essentially  the  same  line  of  reasoning, 
with  the  exception  that  it  involves  y"(/)  instead  of  y-"(i),  and 
the  definitions  of  wt  and  are  different,  as  given  in  (21)  and 
(22).  From  (20)  and  (21)  and  the  distributive  properties  of  the 
semi-invariants  stated  earlier, 

*.[*(  JM»x)]  -  £  ^(w,)y"(i),  (36) 

/-i 

and,  from  the  lemma  in  Section  III-A, 

■*»I^(j'*mj)]  "  £  2"-,(n  -  1)![1  +  2i»X<]y"(»).  (37) 
<- 1 

The  validity  of  the  trace  portion  of  (28)  follows  directly  from 
the  fact  that  y"(i)  are  eigenvalues  of  (C~'Cy)".  To  prove  the 
validity  of  the  second  term  on  the  right  side  of  (28),  we  note  that 

2*n!  £  A,y*(«)  -  2-,n!  £  (m„  -  m„)V_,( «)•  (38) 

i-i  <-i 

The  summation  term  can  be  expressed  as 

4 

£  Cm><  ~  m'if  y"_,(0  “  (m*  -  mJr)7'/4I/>"'U(mI  -  «,). 

(-i 

(39) 

Expansion  of  the  matrix  D"  1  (see  (14))  gives 

D—1  -  •  •  •  ACyAr.  (40) 

However,  from  (13),  ATA  -  C/1.  Using  this  fact  in  (40)  and  (39) 
completes  the  proof. 

It  is  important  to  note  that  the  semi-invariants  in  (27)  and  (28) 
are  given  directly  in  terms  of  the  original  population  parameters 
r„  m,,  C„  and  Cy  and,  therefore,  do  not  require  computation 
of  the  transformation  matrix  A . 


Equal  Mean  Vectors:  When  mx  —  m.,  we  have  from  (27)  and 
(28)  that 

*„[*(*,*,)]  -2-'(«-l)![tr{(C/'C,)’}]  (42) 

and 

*[*(/.«.)]  -  2—(n  -  l)l[tr  {(C'C,)’)].  (43) 

Equal  Mean  Vectors  and  Covariance  Matrices  (Intraclass  Mafia - 
lanobis  Distance):  In  this  case  we  are  considering  the  same 
Gaussian  population  and  obtain  the  semi-invariants  by  letting 
x  -  y,  m,  -  my,  C„-  Cy  in  either  (27)  or  (28).  This  results  in 
the  simple  expression 

*,[*(*.mx)]-2-,(n-l)W  (44) 

which  depends  only  on  the  order  of  the  semi-invariant  and  the 
dimensionality  of  the  vectors. 

IV.  Obtaining  the  Moments  from  the 
Semi-Invariants 

The  nth  moment  a„  of  the  interclass  Mahalanobis  distance  can 
be  obtained  directly  from  the  semi-invariants  Xl,X2,-,Xn  by 
using  the  expression 

nu-A'r  <«> 

where  the  sum  is  taken  over  values  of  at  such  that  a2  +  la2 
+  +na„  -  n  [11].  Equation  (45)  can  be  used  to  obtain  the 
moments  of  either  w,  or  R,  given  the  semi-invariants  correspond¬ 
ing  to  one  of  these  two  variables  (11).  A  similar  relationship  exists 
for  computing  the  semi-invariants  given  the  moments  [11]. 

As  an  illustration  of  the  above  relationship  we  have 

«.  -  Xi 

a2-X2  +  X } 

a3-X}  +  JX,X2  +  X} 

«4  -  X4  +  4XtXj  +  IX}  +  6X,2X2  +  X*. 

A  direct  implementation  of  (45)  in  a  digital  computer  is 
inherently  inefficient,  involving,  among  other  things,  the  de¬ 
termination  of  all  n  tuples  of  nonnegative  integers  (a^- •  -,a„) 
satisfying  a,  +  2 a2  +■•■  na„  -  n.  Fortunately,  there  is  a  very 
efficient  recursive  algorithm,  described  below,  for  computing 
these  moments. 

Let 

Z,  -  XJnt 2*  (46) 

where  Xm  is  given  by  (27)  or  (28),  as  determined  by  the  relevant 
semi-invariant.  Under  this  change  of  variables  (45)  become-' 

“  «!2*£  n  (Zf’/ar\)  (47) 

r-l 


C.  Special  Cases 

In  this  section  we  consider  special  cases  involving  various 
arrangements  of  mean  vectors  and  covariance  matrices  of  two 
populations. 

Equal  Covariance  Matrices:  For  equal  covariance  matrices, 
C,  -  Cy  -  C,  it  is  easily  shown  that  (27)  and  (28)  reduce  to 

*„[*(*,«,)]  -  XH[R(y,mM)]  -  2-*(»  -  1)! 

•[d  +  n(mM  -  my)TC~x(m,  -  m,)] . 

(41) 


taken  over  all  n  tuples  of  nonnegative  integers  (a,,-  •  -,a„) 
satisfying  a,  +  2a2  +  •  •  •  +na„  »  n.  If  we  define  a  rectangular 
array  P(n,k)  by 

P{n,k)-ZT\(V'/ar')>  (48) 

r-1 

taken  over  all  k  tuples  of  nonnegative  integers  (a,,-  •  -,ak) 
satisfying  a,  +  2a2  +  •  •  •  +kak  -  n,  then  by  (47) 

a„  -  n!2*0(n,n),  n  >  1,  (49) 

so  that  the  quantities  a,  may  easily  be  computed  from  the 
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elements  lying  just  below  the  main  diagonal  (the  main  diagonal 


elements  of  ft (n,  k)  are  P(n  -  1,  n),  since  n  >  0  and  Ac  > 
the  arrays  fi(n,k). 

It  is  dear  from  (48)  that 

1)  of 

j8(0,Ac)-l,  Ac  >  1 

(50) 

that 

P(n,l)  —  Z'/nl,  n>  0 

(51) 

and  that 

P(n,k)  -  P(n,n),  k>n>0. 

(52) 

The  remaining  dements  of  the  array  P(n,  Ac)  are  generated  by  the 
recurrence  relation 

(«!/*) 

/»(«.*)“  £  P{i  ~  jk,k  —  \)Z{/j\ 

(53) 

J- o 


Where  [n/k]  denotes  the  greatest  integer  <  n/k.  Hence  the 
elements  in  column  k  of  the  array  P{n,k)  are  just  linear  combi¬ 
nations  of  certain  elements  in  column  Ac  -  1.  Equation  (53)  is 
justified  by  observing  that  the  nonnegative  integral  solutions  of 
a,  +  la 2  +  •  •  *  +ka*  -  n  may  be  partitioned  according  to  the 
possible  values  j  -  0, 1,-  •  -,[n/k  1  of  ak,  the  terms  in  (53)  corre¬ 
sponding  to  those  [n/k]  +  1  possible  values  of  ak.  Values  of 
P(n,  Ac)  for  0  <  n  <  3  and  1  <  k  «  3  are  listed  in  Table  I. 

V.  Conclusion 

The  expressions  given  in  (27)  and  (28)  provide  a  straight-for¬ 
ward  solution  to  the  problem  of  computing  the  semi-invariants  of  i 
the  interclass  Mahalanobis  distance.  As  indicated  in  Section  I,  j 
th»  semi-invariants  are  useful  descriptions  of  the  underlying 
interdass  distance  pdf.  j 

Although  the  semi-invariants  do  not  in  general  have  the  familiar 
“physical”  interpretation  possessed  by  the  moments  (e.g.,  spread, 
skew,  and  curtosis),  the  distributive  property  of  the  semi-in¬ 
variants  resulted  in  a  computational  procedure  involving  only  the 
mean  vectors  and  inverse  covariance  matrices  of  two  populations, 
without  the  need  for  the  simultaneous  diagonalization  required  to 
obtain  the  moments  [17],  The  algorithm  given  in  Section  IV 
provides  a  rather  simple  iterative  technique  for  computing  the 
moments  once  the  semi-invariants  have  been  obtained  via  (27) 
and  (28). 

The  semi-invariants  were  considerably  simplified  in  the  special  ; 
cases  discussed  in  Section  1II-C.  In  particular,  the  semi-invariants 
of  the  intraclass  Mahalanobis  distance  was  shown  to  be  depen¬ 
dent  only  on  the  order  of  the  semi-invariants  and  on  the  dimen¬ 
sionality  of  the  vector  populations. 
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